Philosophy of Statistics
Handbook of the Philosophy of Science
General Editors
Dov M. Gabbay Paul Thagard John Woods
AMSTERDAM • BOSTON • HEIDELBERG • LONDON • NEW YORK • OXFORD PARIS • SAN DIEGO • SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO North Holland is an imprint of Elsevier
Handbook of Philosophy of Science Volume 7 Philosophy of Statistics
Edited by
Prasanta S. Bandyopadhyay Montana State University, USA
Malcolm R. Forster Tsinghua University, China and University of Wisconsin-Madison, USA
AMSTERDAM • BOSTON • HEIDELBERG • LONDON • NEW YORK • OXFORD PARIS • SAN DIEGO • SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO North Holland is an imprint of Elsevier
North Holland is an imprint of Elsevier The Boulevard, Langford lane, Kidlington, Oxford, OX5 1GB, UK Radarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA First edition 2011 Copyright © 2011 Elsevier B.V. All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email:
[email protected]. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-444-51862-0 ISSN: 0031-8019 For information on all North Holland publications visit our web site at elsevierdirect.com
Printed and bound in Great Britain 11 12 13 14 10 9 8 7 6 5 4 3 2 1
In memory of M. A. Sattar (PSB) In loving memory of my parents Lyn Forster and Ray Forster (MRF)
This page intentionally left blank
GENERAL PREFACE Dov Gabbay, Paul Thagard, and John Woods Whenever science operates at the cutting edge of what is known, it invariably runs into philosophical issues about the nature of knowledge and reality. Scientific controversies raise such questions as the relation of theory and experiment, the nature of explanation, and the extent to which science can approximate to the truth. Within particular sciences, special concerns arise about what exists and how it can be known, for example in physics about the nature of space and time, and in psychology about the nature of consciousness. Hence the philosophy of science is an essential part of the scientific investigation of the world. In recent decades, philosophy of science has become an increasingly central part of philosophy in general. Although there are still philosophers who think that theories of knowledge and reality can be developed by pure reflection, much current philosophical work finds it necessary and valuable to take into account relevant scientific findings. For example, the philosophy of mind is now closely tied to empirical psychology, and political theory often intersects with economics. Thus philosophy of science provides a valuable bridge between philosophical and scientific inquiry. More and more, the philosophy of science concerns itself not just with general issues about the nature and validity of science, but especially with particular issues that arise in specific sciences. Accordingly, we have organized this Handbook into many volumes reflecting the full range of current research in the philosophy of science. We invited volume editors who are fully involved in the specific sciences, and are delighted that they have solicited contributions by scientifically-informed philosophers and (in a few cases) philosophically-informed scientists. The result is the most comprehensive review ever provided of the philosophy of science. Here are the volumes in the Handbook: Philosophy of Science: Focal Issues, edited by Theo Kuipers. Philosophy of Physics, edited by Jeremy Butterfield and John Earman. Philosophy of Biology, edited by Mohan Matthen and Christopher Stephens. Philosophy of Mathematics, edited by Andrew Irvine. Philosophy of Logic, edited by Dale Jacquette. Philosophy of Chemistry and Pharmacology, edited by Andrea Woody, Robin Hendry and Paul Needham.
viii
Dov Gabbay, Paul Thagard, and John Woods
Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm Forster. Philosophy of Information, edited by Pieter Adriaans and Johan van Benthem. Philosophy of Technology and Engineering Sciences, edited by Anthonie Meijers. Philosophy of Complex Systems, edited by Cliff Hooker. Philosophy of Ecology, edited by Bryson Brown, Kent A. Peacock and Kevin deLaplante. Philosophy of Psychology and Cognitive Science, edited by Paul Thagard. Philosophy of Economics, edited by Uskali M¨aki. Philosophy of Linguistics, edited by Ruth Kempson, Tim Fernando and Nicholas Asher. Philosophy of Anthropology and Sociology, edited by Stephen Turner and Mark Risjord. Philosophy of Medicine, edited by Fred Gifford. Details about the contents and publishing schedule of the volumes can be found at http://www.elsevier.com/wps/find/bookdescription.cws_ home/BS HPHS/description# description As general editors, we are extremely grateful to the volume editors for arranging such a distinguished array of contributors and for managing their contributions. Production of these volumes has been a huge enterprise, and our warmest thanks go to Jane Spurr and Carol Woods for putting them together. Thanks also to Lauren Schultz and Derek Coleman at Elsevier for their support and direction.
PREFACE Prasanta S. Bandyopadhyay and Malcolm R. Forster “Nowadays,” complained the American statistician William Kurskal more than four decades ago, “writers calling themselves statisticians and those calling themselves philosophers of science often refer to each [other], but [their] communication is restricted and piecemeal.”1 This volume aims to remedy these shortcomings which, unfortunately, have continued to plague the disciplines. It provides stateof-the-art research in the area of Philosophy of Statistics by encouraging numerous experts to communicate with one another without feeling “restricted” by their disciplines or thinking “piecemeal” in their treatment of issues. A second goal of this book is to present work in the field without bias toward any particular statistical paradigm. Broadly speaking, the essays in this Handbook are concerned with problems of induction, statistics and probability. For centuries, foundational problems like induction have been among philosophers’ favorite topics; recently, however, nonphilosophers have increasingly taken a keen interest in these issues. This volume accordingly contains papers by both philosophers and non-philosophers, including scholars from nine academic disciplines. In addition, while the Handbook deals primarily with Anglo-American analyses and data, it also includes various approaches by many non-Western authors. The contributors hail from more than ten countries spread over four continents, providing the volume with a valuable global dimension. Statistics, of course, affect virtually every aspect of human life everywhere, from Spain to China to Mauritius. Appreciating the power of statistics, many of the authors have utilized data far from their places of origin, and, to the book’s benefit, they are not hesitant to exploit this information to present their own interpretations of complex issues. Moreover, just as statistics has a theoretical side, so it also has an applied side. Several chapters in this volume have combined those two aspects. Additionally, we provide essays that are straight-forwardly historical and address both western and non-western views on probability, thus making available ramifications of statistics and probability in different cultures. A few words are in order regarding our volume and the kind of assistance of other scholars. We relied on an international team of experts in preparing the papers offered here. The referees were extremely generous with their time, advice and suggestions. James Hawthorne did far more than his share of reading and offering valuable suggestions. Clark Glymour, Theodore Porter, Jeff Paris, Antony Eagle, David Hitchcock, Prabal K. Sen, Jose Miguel Ponciano, Mylavarapu Deekshithulu, 1 W. Kruskal and J. Tanur, eds. International Encyclopedia of Statistics, Vol. 2, Free Press, Collier Macmillan Publishers, London, 1968, p.1082.
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
x
Prasanta S. Bandyopadhyay and Malcolm R. Forster
Jay Herson, Bijoy Mukherjee, and Oliver Schulte also deserve high praise. Without the help of these scholars, the Handbook would not have taken its present shape with such variety, complexity, depth and scope. While writing and revising the Introduction, we received valuable help from the essays’ authors regarding whether we correctly represented their views. Our drafts were shuttled back and forth between the authors and the editors until both were satisfied. The process enabled us to clarify our comments on their essays. We believe this makes the Introduction more representative of how both the authors and we the editors understand the contributors’ arguments and contributions. Although our Handbook is substantial in terms of its variety, scope, complexity, and depth, it is not designed to provide an all-encompassing Encyclopedia of Statistics. We were unable to include some significant topics like experimental designs, the identifiability problem in statistics, bootstrapping, and cross-validation. We hope that the readers will still find it useful and significant in terms of its contribution to numerous fields which apply and use statistics in a reflective manner. The question remains whether scholars have been able to communicate more effectively across disciplinary divides since the publication of Kruskal’s paper four decades ago. Some scholars have mixed views about the merit of this sort of interaction. One foremost philosophers of science noted that compared to twentyfive years ago, today’s philosopher of science are considerably more dependent on knowing much more of the actual science. By implication, modern philosophy of science should be interdisciplinary, and that is good for the well-being of the subject. A statistician of great eminence, however, holds a different opinion about the interaction between statisticians and philosophers. Based on his experience, he notes that when statisticians and philosophers meet, they diverge like two logs floating in a river,2 touching one another solely for the purpose of taking two different routes.3 What are we to make of these conflicting perspectives among experts? Ultimately, we leave it to the reader to evaluate whether this volume is able to make some headway in the direction of multi-disciplinary dialogue, and whether this dialogue is worthwhile for the well-being of these overlapping disciples. However, we firmly believe that the effort at cross-disciplinary communication is a worthwhile endeavor. It would be inappropriate to conclude the preface without recording our thanks to the people who were involved from its very inception. It would be more than an understatement if we claimed to be able to produce a volume of this size had we not leaned on many individuals for guidance and help. We thank John Woods, the series editor, for his guidance and suggestions. We are especially grateful to Jane Spurr who worked tirelessly toward the completion of this volume and responded patiently to our daily emails. At times, we began to feel the despair noted by 2 This
log analogy has been borrowed from the Ramayana (the “Ayodhya Section”), one of the epics of the Indian sub-continent. 3 However, he does not like to sound so pessimistic. This came out more as his reaction to his experience of the present state of interactions between these two disciplines rather then what he believes theoretically possible about a better interaction between philosophy of science and statistics.
Preface
xi
Woody Allen: “eternity is very long, especially towards the end.”4 During those trying times, Jane reassured us that the task was almost completed. We are truly indebted to her for both her ceaseless work and her dedication to the production of this volume. We thank both A. Philip Dawid, Jaya nta K.Ghosh, and Colin Howson for their suggestions regarding the volume, as well as our colleagues and friends for their their support, including James Allard, Anindya Bandyopadhyay, Hironmay Banerjee, John G. Bennett, Gordon Brittan, Debiprasad Chattopadyay, David Cherry , Steve Cherry, Abhijit Dasgupta, Simon Dixon, Arthur Falk, Dan Flory, Jack Gilchrist, Sanford Levy, Michelle Maskiell, Sue Monahan, Prodyot K. Mukhopadhyay, Mujib Rahman, Michael Reidy, Abdur Razzaque, Tasneem Sattar, Billy Smith and Mark Taper. We are also thankful to Derek Coleman and his team in Amsterdam and Mohana Natarajan and her team in Chennai for their constant assistance and suggestion with regard to printing style and publication quality. PSB acknowledges the help of his department’s three superb administrative assistants, Diane Cattrell, Deidre Manry, and Jessica Marks, for their smiling faces whenever he needed long printouts for the volume. PSB is also thankful to the Montana State University’s legal team for its expert opinion in spelling out some features of the contract. Last but not least, the generous support received by PSB from the Astrobiology Biogeocatalysis Reasearch Centre (NASA grant # 4w1781) and a 2007 summer grant from Montana State University’s Vice-Provost Office assisted him in carrying out this research. MRF thanks the University of Wisconsin-Madison and the University of Pittsburgh Centre for the Philosophy of Science for their support during the spring semester of 2006 PSB MRF 25 November, 2010
4 Quoted in Martin Rees’ Just Six Numbers: The Deep Forces that Shape the Universe. Basic Books, Great Britain, 2000, p.71.
This page intentionally left blank
CONTRIBUTORS
Davis Baird Clark University, USA.
[email protected] Prasanta S. Bandyopadhyay Montana State University, USA.
[email protected] Deborah Bennett New Jersey City University, USA.
[email protected] Jos´ e M. Bernardo Universitat de Val`encia, Spain.
[email protected] Jeffrey D. Blume Vanderbilt University School of Medicine, USA.
[email protected] Robert J. Boik Montana State University, USA.
[email protected] Arijit Chakrabarti Indian Statistical Institute, India.
[email protected] Richard Charnigo University of Kentucky, USA.
[email protected] Steve Cherry Montana State University, USA.
[email protected] Abhijit Dasgupta University of Detroit Mercy, USA.
[email protected] A. Philip Dawid Cambridge University, UK.
[email protected] xiv
Michael Dickson University of South Carolina, USA.
[email protected] David L. Dowe Monash University, Australia. david~dot~dowe~at~infotech.monash.edu.au Kenny Easwaran University of Southern California, USA.
[email protected] Roberto Festa University of Trieste, Italy.
[email protected] Malcolm R. Forster Tsinghua University, China and University of Wisconsin-Madison, USA.
[email protected] Jayanta K. Ghosh Indian Statistical Institute, India and Purdue University, USA.
[email protected] Sander Greenland University of California, Los Angeles, USA.
[email protected] Mark C. Greenwood Montana State University, USA.
[email protected] Jason Grossman Australian National University, Australia.
[email protected] Peter D. Gr¨ unwald University of Leiden, The Netherlands.
[email protected] Alan H´ ajek Australian National University, Australia.
[email protected] Gilbert Harman Princeton University, USA.
[email protected] Joel Harper University of Montana, USA.
[email protected] James Hawthorne University of Oklahoma, USA.
[email protected] Contributors
Colin Howson University of Toronto, Canada.
[email protected] Kevin T. Kelly Carnegie Mellon University, USA.
[email protected] Sanjeev Kulkarni Princeton University, USA.
[email protected] Subhash R. Lele University of Alberta, Canada.
[email protected] Deborah G. Mayo Virginia Tech, USA.
[email protected] Johnnie Moore University of Montana, USA.
[email protected] John D. Norton University of Pittsburgh, USA.
[email protected] C. K. Raju Universiti Sains Malaysia, Malaysia.
[email protected] Jan-Willem Romeijn University of Groningen, The Netherlands.
[email protected] Steven de Rooij University of Cambridge, UK.
[email protected] Elliott Sober University of Wisconsin, USA.
[email protected] Aris Spanos Virginia Tech, USA.
[email protected] Peter Spirtes Carnegie Mellon University, USA.
[email protected] Cidambi Srinivasan University of Kentucky, USA.
[email protected] xv
xvi
Daniel Steel Michigan State University, USA.
[email protected] Mark L. Taper Montana State University, USA.
[email protected] Choh Man Teng Institute for Human and Machine Cognition, USA.
[email protected] C. Andy Tsao Natioal Dong Hwa University, Taiwan.
[email protected] Susan Vineberg Wayne State University, USA.
[email protected] Paul Weirich University of Missouri, USA.
[email protected] Gregory Wheeler New University of Lisbon, Porugal.
[email protected] Jon Williamson University of Kent, UK.
[email protected] Sandy L. Zabell Northwestern University, USA.
[email protected] CONTENTS General Preface Dov Gabbay, Paul Thagard, and John Woods
vii
Preface Prasanta S. Bandyopadhyay and Malcolm R. Forster
ix
List of Contributors
xiii
Introduction Philosophy of Statistics: An Introduction Prasanta S. Bandyopadhyay and Malcolm R. Forster
1
Part I. Probability & Statistics Elementary Probability and Statistics: A Primer Prasanta S. Bandyopadhyay and Steve Cherry
53
Part II. Philosophical Controversies about Conditional Probability Conditional Probability Alan H´ ajek The Varieties of Conditional Probability Kenny Easwaran
99 137
Part III. Four Paradigms of Statistics Classical Statistics Paradigm Error Statistics Deborah G. Mayo and Aris Spanos
153
Significance Testing Michael Dickson and Davis Baird
199
Bayesian Paradigm The Bayesian Decision-Theoretic Approach to Statistics Paul Weirich
233
18
Modern Bayesian Inference: Foundations and Objective Methods Jos´ e M. Bernardo
263
Evidential Probability and Objective Bayesian Epistemology Gregory Wheeler and Jon Williamson
307
Confirmation Theory James Hawthorne
333
Challenges to Bayesian Confirmation Theory John D. Norton
391
Bayesianism as a Pure Logic of Inference Colin Howson
441
Bayesian Inductive Logic, Verisimilitude, and Statistics Roberto Festa
473
Likelihood Paradigm Likelihood and its Evidential Framework Jeffrey D. Blume
493
Evidence, Evidence Functions, and Error Probabilities Mark L. Taper and Subhash R. Lele
513
Akaikean Paradigm AIC Scores as Evidence — a Bayesian Interpretation Malcolm Forster and Elliott Sober
535
Part IV: The Likelihood Principle The Likelihood Principle Jason Grossman
553
Part V: Recent Advances in Model Selection AIC, BIC and Recent Advances in Model Selection Arijit Chakrabarti and Jayanta K. Ghosh
583
Posterior Model Probabilities A. Philip Dawid
607
19
Part VI: Attempts to Understand Different Aspects of “Randomness” Defining Randomness Deborah Bennett
633
Mathematical Foundations of Randomness Abhijit Dasgupta
641
Part VII: Probabilistic and Statistical Paradoxes Paradoxes of Probability Susan Vineberg
713
Statistical Paradoxes: Take It to The Limit C. Andy Tsao
737
Part VIII: Statistics and Inductive Inference Statistics as Inductive Inference Jan-Willem Romeijn
751
Part IX: Various Issues about Causal Inference Common Cause in Causal Inference Peter Spirtes
777
The Logic and Philosophy of Causal Inference: A Statistical Perspective Sander Greenland
813
Part X: Some Philosophical Issues Concerning Statistical Learning Theory Statistical Learning Theory as a Framework for the Philosophy of Induction Gilbert Harman and Sanjeev Kulkarni
833
Testability and Statistical Learning Theory Daniel Steel
849
20
Part XI: Different Approaches to Simplicity Related to Inference and Truth Luckiness and Regret in Minimum Description Length Inference 865 Steven de Rooij and Peter D. Gr¨ unwald MML, Hybrid Bayesian Network Graphical Models, Statistical Consistency, Invariance and Uniqueness David L. Dowe
901
Simplicity, Truth and Probability Kevin T. Kelly
983
Part XII: Special Problems in Statistics/Computer Science Normal Approximations Robert J. Boik
1027
Stein’s Phenomenon Richard Charnigo and Cidambi Srinivasan
1073
Data, Data, Everywhere: Statistical Issues in Data Mining Choh Man Teng
1099
Part XIII: An Application of Statistics to Climate Change An Application of Statistics in Climate Change: Detection of Nonlinear Changes in a Streamflow Timing Measure in the Columbia and Missouri Headwaters Mark C. Greenwood, Joel Harper and Johnnie Moore
1121
Part XIV: Historical Approaches to Probability/Statistics The Subjective and the Objective Sandy L. Zabell
1149
Probability in Ancient India C. K. Raju
1175
Index
1197
PHILOSOPHY OF STATISTICS: AN INTRODUCTION Prasanta S. Bandyopadhyay and Malcolm R. Forster
1
PHILOSOPHY, STATISTICS, AND PHILOSOPHY OF STATISTICS
The expression “philosophy of statistics” contains two key terms: “philosophy” and “statistics.” Although it is hard to define those terms precisely, they convey some intuitive meanings. For our present purpose, those intuitive meanings are a good place to embark on our journey. Philosophy has a broader scope than the specific sciences. It is concerned with general principles and issues. In contrast, “statistics” is a specific branch of knowledge that, among many other activities, includes addressing reliable ways of gathering data and making inferences based on them. Perhaps the single most important topic in statistics is how to make reliable inferences. As a result, statisticians are interested in knowing which tools to use and what mechanisms to employ in making and correcting our inferences. In this sense, the general problem of statistics is very much like the problem of induction in which philosophers have long been interested. In fact, statisticians as diverse as Ronald Fisher [1973], Jerzy Neyman [1967] and Bruno de Finetti [1964] characterized the approaches they originated as methods for inductive inferences.1,2 Before we begin our discussion, it is worthwhile mentioning a couple of salient features of this volume. It contains thirty-seven new papers written by forty-eight authors coming from several fields of expertise. They include philosophy, statistics, mathematics, computer science, economics, ecology, electrical engineering, epidemiology, and geo-science. In the introduction, we will provide an outline of each paper without trying to offer expert commentary on all of them. Our emphasis on some topics rather than others in the following discussion reflects our own interest and focus without downplaying the significance of those topics less discussed. We encourage readers to start with the paper(s) that kindles their interest and lie within their own research areas. In the western world, David Hume [1739] has been credited with formulating the problem of induction in a particularly compelling fashion. The problem of induction arises when one makes an inference about an unobserved body of data 1 See
[Seidenfeld, 1979] for a close look at Fisher’s views on statistical inference. a clear discussion of de Finetti’s view on the connection between the subjective degree of belief and inductive learning, see [Skyrms, 1984]. 2 For
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
2
Prasanta S. Bandyopadhyay and Malcolm R. Forster
based on an observed body of data.3 However, there is no assurance that the inference in question will be valid because the next datum we observe may differ from those already gathered. Furthermore, to assume that they will be the same, according to Hume, is to assume what we need to prove. The problem of induction in this sense has very possibly turned out to be an irresolvable problem. Instead of addressing the problem of induction in the way Hume has described it, in terms of certainty, we are more often interested in knowing how or whether we would be able to make better inductive inferences in the sense that they are likely to be true most of the time, that is, in terms of reliability. However, to be able to make reliable inferences we still require substantial assumptions about the relationship between the observed data and the unobserved data. One such assumption, which is sometimes called “the uniformity of nature” assumption and was questioned by Hume, is that the future data will be like the past data. In addition, we sometime also make some empirical assumptions about the world. One such assumption is that the world is simple, such as in the sense of being isotropic that there are laws that apply across all points in space and time, or at least across the domain of interest. For philosophers, this assumption is, in some sense, reminiscent of the assumption involved in the contrast between the green hypothesis (i.e., all emeralds are green) and the grue-hypothesis (i.e., all emeralds are grue.) X is defined as grue if and only if x is green and was observed before time t or x is blue and was not observed before t. The hypotheses are equally consistent with current data, even though the hypotheses are different. So, which hypothesis should we choose given they are equally supported by the data? The grue-green problem teaches us that we do commonly make (unusually unexamined) assumptions about the concepts we can use and the way we can use them in constructing hypotheses. That is, we make (often unarticulated) assumptions about the ways in which the world is uniform or simple. As there is a quandary over whether simplicity is a legitimate factor in scientific inference, so there is another heated discussion regarding what kinds of assumptions other than the uniformity of nature assumption, and perhaps also simplicity, are warranted. These debates over the correct assumptions and approaches to inductive inference are as rampant in the choice of one’s statistical paradigm (for example, classical/error statistics, Bayesianism, likelihoodism or the Akaikean framework, to mention only the most prominent) as they are in the applied approaches to automated inductive inference in computer science. In philosophy of statistics, we are interested in the foundational questions including the debates about statistical paradigms regarding which one provides the right direction and method for carrying out statistical inference, if indeed any of them do in general. On one end of the spectrum, we will discuss several major statistical paradigms with their assumptions and viewpoints. On the other end, we also consider broader issues — like the issue of inductive inference — after understanding these statistical paradigms in more detail. We are likewise interested in 3 Taleb has adopted a distinctively unique approach to the issues concerning induction especially when they involve the 9/11 event and the Wall-Street crash of 2008 [Taleb, 2010].
Philosophy of Statistics: An Introduction
3
specific questions between these two extremes, as well as more modern viewpoints that adopt a “tool-kit” perspective. In that tool-kit perspective, each paradigm is merely a collection of inferential tools with limitations and appropriate (as well as inappropriate) domain of application. We will thus consider issues including the following: the causal inference in observational studies; the recent advances in model selection criteria; such foundational questions as “whether one should accept the likelihood principle (LP)” and “what is conditional probability”; the nature of statistical/probabilistic paradoxes, the problems associated with understanding the notion of randomness, the Stein phenomenon, general problems in data mining, and a number of applied and historical issues in probability and statistics.
2
FOUR STATISTICAL PARADIGMS
Sometimes different approaches to scientific inference among philosophers, computer scientists, and statisticians stem from their adherence to competing statistical paradigms. Four statistical paradigms will be considered in the following discussion: (1) classical statistics or error statistics, (ii) Bayesian statistics, (iii) likelihood-based statistics, and (iv) the Akaikean-Information Criterion-based statistics. How do they differ in their approaches to statistical inference?
2.1
Four statistical paradigms and four types of questions
To address this question, consider two hypotheses: H, representing that a patient suffers from tuberculosis, and ∼ H, its denial. Assume that an Xray, which is administered as a routine test, comes out positive for the patient. Based on this simple scenario, following the work of Richard Royall, one could pose three questions that underline the epistemological issue at stake within three competing statistical paradigms [Royall, 1997]: (i)
Given the datum, what should we believe and to what degree?
(ii) What does the datum say regarding evidence for H against its alternative? (iii) Given the datum what should we do? The first question we call the belief question, the second the evidence question, and the third the decision question. Royall thinks that Bayesians address the belief question, that classical/error statistics address the decision question, and that only the likelihood program addresses the evidence question. Sharpening Royall’s evidence question, the AIC framework can be taken to be addressing what we call the prediction question:
4
Prasanta S. Bandyopadhyay and Malcolm R. Forster
(iv) What does the datum tell us about the predictive accuracy of the hypothesis? We will discuss how four statistical paradigms revolve round these four types of questions: (i) the belief question, (ii) the evidence question, (iii) what to do question, and (iv) the prediction question.
2.2
Classical statistics/error statistics paradigm
Deborah Mayo and Aris Spanos have put forward an error statistical approach to statistical inference. Unlike Royall, however, they think that error statistics is successful in addressing the evidence question as well as the decision question. Error statistics provides both a tool-kit for doing statistics as well as advancing a philosophy of science in which probability plays a key role in arriving at reliable inferences and severe tests. Mayo and Spanos have proposed a detailed account of the severity of testing within error statistical framework. Suppose George measures his weight on a scale on two dates, and considers the hypothesis H that he has gained no more than 3 pounds during that time. If the measured difference is one pound, and the scale is known to be sensitive to the addition of a 0.1 pound potato, then we can say that the hypothesis has survived a severe test because the measured difference would have been greater if the hypothesis were false, and he had gained more than 3 pounds. The correct justification for such an inference is not that it would rarely be wrong in the long run of repetitions of weighing as on the strict behavioristic interpretation of tests. Instead, they argue, it is because the test had a very high capacity to have detected the falsity of H, and did not. Focusing on a familiar one sided Normal test of the mean, Mayo and Spanos show how a severity interpretation of tests addresses each of the criticisms often raised against the use of error probabilities in significance tests. Pre-data, error probabilities ensure that a rejection indicates with severity some discrepancy from the null, and that failing to reject the null rules out with severity those alternatives against which the test has high power. Post-data, one can go much further in determining the discrepancies from the null warranted by the actual data in hand. This is the linchpin of their error statistical philosophy of statistics.4 Taking frequency statistics as crucial for understanding significance tests, Michael Dickson and Davis Baird have discussed how, on numerous occasions in social science literature, significance tests have been used, misused and abused without implying that a well-designed significance test may not have any value. Both authors have explored the history of the use of significance tests, including the controversy between Mendelians and Darwinists in examining Mendel’s work from a statistical perspective. In this regard, they discuss how Ronald Fisher, while attempting to reconcile the debates between Mendelians and Darwinists, came to realize that Mendel’s report on 600 plants is questionable since the data 4 We have lumped classical statistics with Fisher’s significance testing following some of the authors of this volume. Historically, however, classical statistics is distinguished from the theory of significance testing. We owe this point of clarification to Sander Greenland.
Philosophy of Statistics: An Introduction
5
that report exploited “were too good to be true”. One of the issues both Dickson and Baird wonder about is how auxiliaries along with the hypotheses are tested within this significance test framework. In a theory testing framework, philosophers are usually concerned about how or when a theory is confirmed or disconfirmed. According to P. Duhem and W. V. Quine and, a theory is confirmed or disconfirmed as a whole along with its auxiliaries and background information. Their view is known not surprisingly as the Duhem-Quine thesis. In the original Duhem’s statement: “[a]n experiment in physics can never condemn an isolated hypothesis but only a whole theoretical group.” Dickson and Baird wonder (without offering a solution) how or whether significance testing could contribute to our understanding of drawing inferences from a body of evidence within the context of the Duhem-Quine thesis.
2.3
Bayesian statistics paradigm
It is generally agreed by its supporters and critics alike that Bayesianism5 currently is the dominant view in the philosophy of science. Some statisticians have gone further, conjecturing years ago that Bayesian statistics will be the dominant statistics for the twenty-first century. Whether this claim can be substantiated is beyond the scope of this introduction. However, it is uncontestable that the Bayesian paradigm has been playing a central role in such disciplines as philosophy, statistics, computer science, and even jurisprudence. Bayesians are broadly divided into subjective and objective categories. According to all Bayesians, an agent’s belief must satisfy the rules of the probability calculus. Otherwise, in accordance with the familiar “Dutch Book” argument, the agent’s degree of belief is incoherent. Subjective Bayesians take this (probabilistic) coherence to be both a necessary and a sufficient condition for the rationality of an agent’s beliefs, and then (typically) argue that the beliefs of rational agents will converge over time. The point of scientific inference, and the source of its “objectivity,” is to guarantee coherence and ensure convergence. Objective Bayesians, on the other hand, typically insist that while the coherence condition is necessary, it is not also sufficient for the kind of objectivity which scientific methodologies are intended to make possible. Paul Weirich’s paper in this volume focuses primarily on subjective probability. Weirich has developed a Bayesian decision theoretic approach where he considers how an agent’s beliefs can be revised in light of data. Probabilities represent an agent’s degree of belief. Weirich evaluates several charges against Bayesians. According to one objection he has considered, Bayesianism allows an agent’s degrees of belief to be anything as long as they satisfy the probability calculus. Weirich takes the objection to be implying that Bayesian subjective probabilities must represent an agent’s idiosyncratic beliefs. He has, however, rejected permissive Bayesianism in favor of his version of Bayesianism. The notion of conditional 5 Howson and Urbach’s [2006] book, which is now a classic, has provided a clear account of Bayesian view.
6
Prasanta S. Bandyopadhyay and Malcolm R. Forster
probability on which the principle of conditionalization rests is central for him. According to this principle, an agent should update her degree of belief in a hypothesis (H) in light of data (D) in accordance with the principle of conditionalization, which says that her degree of belief in H after the data is known is given by the conditional probability P (H|D) = P (H&D)/P (D), assuming that P (D) is not zero. Weirich also evaluates charges brought against the use of the principle of conditionalization. Finally, he compares Bayesian statistical decision theory with classical statistics, concluding his paper with an evaluation of the latter. One central area of research in the philosophy of science is Bayesian confirmation theory. James Hawthorne takes Bayesian confirmation theory to provide a logic of how evidence distinguishes among competing hypotheses or theories. He argues that it is misleading to identify Bayesian confirmation theory with the subjective account of probability. Rather, any account that represents the degree to which a hypothesis are supported by evidence as a conditional probability of the hypothesis on the evidence, where the probability function involved satisfies the usual probabilistic axioms, will be a Bayesian confirmation theory, regardless of the interpretation of the notion of probability it employs. For, on any such account Bayes’ theorem will express how what hypotheses say about evidence (via the likelihoods) influences the degree to which hypotheses are supported by evidence (via posterior probabilities). Hawthorne argues that the usual subjective interpretation of the probabilistic confirmation function is severely challenged by extended versions of the problem of old evidence. He shows that on the usual subjectivist interpretation even trivial information an agent may learn about an evidence claim may completely undermine the objectivity of the likelihoods. Thus, insofar as the likelihoods are supposed to be objective (or intersubjectively agreed), the confirmation function cannot bear the usual subjectivist reading. Hawthorne does take prior probabilities to depend on plausibility assessments, but argues that such assessments are not merely subjective, and that Bayesian confirmation theory is not severely handicapped by the sort of subjectivity involved in such assessments. He bases the latter claim on a powerful Bayesian convergence result, which he calls the likelihood ratio convergence theorem. This theorem depends only on likelihoods, not on prior probabilities; and it’s a weak law of large numbers result that supplies explicit bounds on the rate of convergence. It shows that as evidence increases, it becomes highly likely that the evidential outcomes will be such as to make the likelihood ratios come to strongly favor a true hypothesis over each evidentially distinguishable competitor. Thus, any two confirmation functions (employed by different agents) that agree on likelihoods but differ on prior probabilities for hypotheses (provided the prior for the true hypothesis is not too near 0) will tend to produce likelihood ratios that bring posterior probabilities to converge towards 0 for false hypotheses and towards 1 for the true alternative.6 6 Our readers might wonder if Hawthorne’s Likelihood Ratio Convergence Theorem (LRCT) is different from de Finetti’s theorem since in both theorems, in some sense, the swamping of prior probabilities can occur. According de Finetti’s theorem, if two agents are not pigheaded — meaning that neither of them assign extreme (0 or 1) but divergent probabilities to a hypothesis
Philosophy of Statistics: An Introduction
7
John D. Norton seeks to provide a counterbalance to the now dominant view that Bayesian confirmation theory has succeeded in finding the universal logic that (event) and that they accept the property of exchangeability for experiments — then as the data become larger and larger, they tend to assign the same probability value to the hypothesis (event) in question. In fact, Hawthorne’s LRCT is different from de Finetti’s theorem. There are several reasons for regarding why they are different. (i) de Finetti’s theorem assumes exchangeability, which is weaker than probabilistic independence, but exchangeability also entails that the evidential events are “identically distributed” — i.e. each possible experiment (or observation) in the data stream has the same number of “possible outcomes,” and for each such possible outcome of one experiment (or observation) there is a “possible outcome” for each other experiment (or observation) that has the same probability. The version of the LRCT assumes the probabilistic independence of the outcomes relative to each alternative hypothesis (which is stronger than exchangeability in one respect), but does not assume “identically distributed experiments (or observations).” That is, each hypothesis or theory under consideration may contain within it any number of different statistical hypotheses about entirely different kinds of events, and the data stream may draw on such events, to which the various statistical hypotheses within each theory apply. de Finetti’s theorem would only apply (for example) to repeated tosses of a single coin (or of multiple coins, but where each coin is weighted in the same way). The theorem implies that regardless of an agent’s prior probabilities about how the coin is weighted, the agent’s posterior probabilities would “come to agree” with the posterior probabilities of those who started with different priors than the agent in question. By contrast, the LRCT could apply to alternative theories about how “weighting a coin” will influence its chances of coming up heads. Each alternative hypothesis (or theory) about how distributing the weight influences the chances of “heads” will give an alternative (competing) formula for determining chances from weightings. Thus, each alternative hypothesis implies a whole collection of different statistical hypotheses in which each statistical hypothesis will represent each different way of weighting the coin. So testing two alternative hypotheses of this sort against each other would involve testing one collection of statistical hypotheses (due to various weightings) against an alternative collection of statistical hypotheses (that propose different chances of heads on those same weightings). de Finetti’s theorem cannot apply to such hypotheses, because testing hypotheses about how weightings influence chances (using various different weightings of the coins) will not involve events that are all exchangeable with one another. (ii) de Finetti’s theorem need not be about testing one scientific hypothesis or theory against another, but the LRCT is about that. The LRCT shows that sufficient evidence is very likely to make the likelihood ratio that compares a false hypothesis (or theory) to a true hypothesis (or theory) very small — thus, favoring the true hypothesis. The LRCT itself only depends on likelihoods, not on prior probabilities. But the LRCT does imply that if the prior probability of the true hypothesis isn’t too near 0, then the posterior probability of false competitors will be driven to 0 by the diminishing likelihood ratios, and as that happens the posterior probability of the true hypothesis goes towards 1. So the LRCT could be regarded as a “convergence to truth” result in the sense de Finetti’s theorem is. The latter shows that a sufficient amount of evidence will yield posterior probabilities that effectively act as though there is an underlying simple statistical hypothesis governing the experiments; where, with regard to that underlying statistical hypothesis, the experiments will be independent and identically distributed. Regardless of the various “prior probability distributions” over the alternative possible simple statistical hypotheses, there will be “converge to agreement” result on the best statistical model (the best simple statistical hypothesis) that models the exchangeable (identically distributed) events as though they were independent (identically distributed) events. But de Finetti didn’t think of this as “convergence to a true statistical hypothesis”. He seemed to think of it as converging to the best instrumental model. Summing up, the LRCT is more general than de Finetti’s theorem in that the former applies to all statistical theories, not just to those that consist of a single simple statistical hypothesis that only accounts for “repetitions of the same kinds of experiments” that have the same (but unknown) statistical distribution. We owe this point to Hawthorne (in an email communication).
8
Prasanta S. Bandyopadhyay and Malcolm R. Forster
governs evidence and its inductive bearing in science. He allows that Bayesians have good reasons for optimism. Where many others have failed, their system succeeds in specifying a precise calculus, in explicating the inductive principles of other accounts and in combining them into a single consistent theory. However, he urges, its dominance arose only recently in the centuries of Bayesian theorizing and may not last given the persistence of the problems it faces. Many of the problems Norton identifies for Bayesian confirmation theory concern technicalities that our readers may find more or less troubling. In his view, the most serious challenge stems from the Bayesian aspiration to provide a complete account of inductive inference that traces our inductive reasoning back to an initial, neutral state, prior to the incorporation of any evidence. What defeats this aspiration, according to Norton, is the well-known, recalcitrant problem of the priors, recounted in two forms in his chapter. In one form, the problem is that the posterior P (H|D&B), which expresses the inductive support of data D for hypothesis H in conjunction with background information B, is fixed completely by the two “prior” probabilities, P (H&D|B) and P (D|B). If one is a subjectivist and holds that the prior probabilities can be selected at whim, subject only to the axioms of the probability calculus, then, according to Norton, the posterior P (H|D&B) can never be freed of those whims. Or if one is an objectivist and holds that there can be only one correct prior in each specific situation, then, as explained in his chapter, the additivity of a probability measure precludes one assigning truly “informationless priors.” That is for the better, according to Norton, since a truly informationless prior would assign the same value to every contingent proposition in the algebra. The functional dependence of a posterior on the priors would then force all non-trivial posteriors to a single, informationless value. Hence, a Bayesian account can be non-trivial, Norton contends, only if it begins with a rich prior probability distribution whose inductive content is provided by other, non-Bayesian means. Three papers in the volume explore the possibility that Bayesian account could be shown as a form of logic. Colin Howson contends that Bayesianism is a form of deductive logic of inference, while Roberto Festa and Jan-Willem Romeijn contend that Bayesian theory can be cast in the form of inductive inference. To investigate whether Bayesian account can be regarded as a form of deductive inference, Howson looks briefly at the last three hundred’s years of scientific inference and then focuses on why he thinks that Bayesian inference should be considered a form of pure logic of inference. Taking into account the debate over whether probabilistic inference can be regarded as logic of consistency or coherence, he discusses de Finetti’s theory of probability where de Finetti took the theory of probability to say nothing about the world, but takes it as a “logic of uncertainty.” One motivating reason to consider why Bayesian inference should be taken as a logic of pure logic is to note his disagreement with Kyburg’s distinction between the expression “consistency” to be applicable to a system which contains no two inconsistent beliefs and the expression “coherence” to be applicable to degrees of belief. For Howson, the analogy with deductive logic is between the latter imposing con-
Philosophy of Statistics: An Introduction
9
sistency constraints on truth-evaluations and the rules of the probability theory imposing constraints in degree of belief. The remainder of his paper is devoted to developing and interpreting Bayesian inference as a form of pure logic of inference. Both Festa and Romeijn regret that in the past century statistics and inductive inference have developed and flourished more or less independently of one another, without clear signs of symbiosis. Festa zooms in on Bayesian statistics and the Carnap’s theory of inductive probabilities, and shows that in spite of their different conceptual bases, the methods worked out within the latter are essentially identical to those used within the former. He argues that some concepts and methods of inductive logic may be applied in the rational reconstruction of several statistical notions and procedures. According to him, inductive logic suggests some new methods which can be used for different kinds of statistical inference involving analogical considerations. Finally, Festa shows how a Bayesian version of truth approximation can be developed and integrated into a statistical framework.7 Romeijn also investigates the relationship between statistics and inductive logic. Although inductive logic and statistics have developed separately, Romeijn thinks, like Festa, that it is time to explore the interrelationship between the two. In his paper, he investigates whether it is possible to represent various modes of statistical inference in terms of inductive logic. Romeijn considers three key ideas in statistics to forge the link. They are (i) Neyman-Pearson hypothesis testing (NPTH), (ii) maximum-likelihood estimation, and (iii) Bayesian statistics. Romeijn shows, using both Carnapian and Bayesian inductive logic, that the last of two of these ideas (i.e., maximum-likelihood estimation and Bayesian statistics) can be represented naturally in terms of a non-ampliative inductive logic. In the final section of his chapter, NPTH is joined to Bayesian inductive logic by means of interval-based probabilities over the statistical hypotheses. As there are subjective Bayesians so there are objective Bayesians. Jos´e Bernardo is one of them. Since many philosophers are not generally aware of Bernardo’s work, we will devote a relatively longer discussion to it. Bernardo writes that “[i]t has become standard practice,. . . , to describe as ‘objective’ any statistical analysis which only depends on the [statistical] model assumed. In this precise sense (and only in this sense) reference analysis is a method to produce ‘objective’ Bayesian inference” [Bernardo, 2005]. For Bernardo, the reference analysis that he has advocated to promote his brand of objective Bayesianism should be understood in terms of some parametric model of the form M ≡ {P (x|w), x ∈ X, w ∈ Ω}, which describes the conditions under which data have been generated. Here, the data x are assumed to consist of one observation of the random process x ∈ X with probability distribution P (x|w) for some w ∈ Ω. A parametric model is an instance of a statistical model. Bernardo defines θ = θ(w) ∈ Θ to be some vector of interest. All legitimate Bayesian Rinferences about the value θ are captured in its posterior distribution P (θ|x) ∝ P (x|θ, λ)P (θ, λ)dλ provided these inferences are made under an assumed model. Λ
7 For an excellent discussion on the history of inductive probability, see Zabell’s series of papers [2005].
10
Prasanta S. Bandyopadhyay and Malcolm R. Forster
Here, λ is some vector of nuisance parameters and is often referred to as “model” P (x|λ). The attraction of this kind of objectivism is its emphasis on “reference analysis,” which with the help of statistical tools has made further headway in turning its theme of objectivity into a respectable statistical school within Bayesianism. As Bernardo writes, “[r]eference analysis may be described as a method to derive model-based, non-subjective posteriors, based on the information-theoretical ideas, and intended to describe the inferential content of the data for scientific communication” [Bernardo, 1997]. Here by the “inferential content of the data” he means that the former provides “the basis for a method to derive non-subjective posteriors” (Ibid ). Bernardo’s objective Bayesianism consists of the following claims. First, he thinks that the agent’s background information should help the investigator build a statistical model, hence ultimately influence which prior the latter should assign to the model. Therefore, although Bernardo might endorse arriving at a unique probability value as a goal, he does not require that we need to have the unique probability assignment in all issues at our disposal. He writes, “[t]he analyst is supposed to have a unique (often subjective) prior p(w), independently of the design of the experiment, but the scientific community will presumably be interested in comparing the corresponding analyst’s personal posterior with the reference (consensus) posterior associated to the published experimental design.” [Bernardo, 2005, p. 29, the first emphasis is ours]. Second, for Bernardo, statistical inference is nothing but a case of deciding among various models/theories, where decision includes, among other things, the utility of acting on the assumption of the model/theory being empirically adequate. Here, the utility of acting on the empirical adequacy of the model/theory in question might involve some loss function [Bernardo and Smith, 1994, p.69]. In his chapter for this volume, he has developed his version of objective Bayesianism and has addressed several charges raised against his account. In their joint chapter, Gregory Wheeler and Jon Williamson have combined objective Bayesianism with Kyburg’s evidential theory of probability. This position of Bayesianism or any form of Bayesianism seems at odds at Kyburg’s approach to statistical inference that rests on his evidential theory of probability. We will consider Kyburg’s one argument against Bayesianism. Kyburg thinks that we should not regard partial beliefs as “degrees of belief” because (strict) Bayesians (like Savage) are associated with the assumption of a unique probability of a proposition. He discussed the interval-based probability as capturing our partial beliefs about uncertainty. Since the interval-based probability is not Bayesian, it follows that we are not allowed to treat partial beliefs as degrees of belief. Given this opposition between Kyburg’s view on probability and objective Bayesian view, Wheeler and Williamson have tried to show how core ideas of both these two views could be fruitfully accommodated within a single account of scientific inference. To conclude our discussion on Bayesian position while keeping in mind Royall’s attribution of the belief question to Bayesians, many Bayesians would have mixed feelings about this attribution. To some extent some of them might consider it
Philosophy of Statistics: An Introduction
11
to be inappropriately simple-minded. Howson would agree with this attribution with the observation that this would miss some of the nuances and subtleties of Bayesian theory. He broadly follows de Finetti’s line in taking subjective evaluations of probability. These evaluations are usually called “degrees of belief.” So to that extent he certainty thinks that there is a central role for degrees of belief, since after all they are what is referred to directly by the probability function. Therefore, according to him, the attribution of the belief question to Bayesians makes some sense. However, he thinks that the main body of the Bayesian theory consists in identifying the constraints which should be imposed on these to ensure their consistency/coherence. His paper has provided that framework for Bayesianism. Hawthorne might partially disagree with Royall since his Likelihood Ratio Convergence Theorem shows that how different agents could agree in the end even though they could very well start with varying degrees of belief in a theory. Both Weirich and Norton, although they belong to opposing camps insofar as their stances toward Bayesianism are concerned, might agree that Royall’s attribution to Bayesians is after all justified. With regard to the prediction question, many Bayesians, including those who work within the confines of confirmation theory, would argue that an account of confirmation that responds to the belief question is able to handle the prediction question as, for Bayesians, the latter is a sub-class of the belief question.
2.4
Likelihood-based statistics paradigm
Another key statistical paradigm is the “likelihood framework,” broadly construed. It stands between Bayesianism and error statistics (i.e., Frequentist, or hypothesis testing), because it focuses on what is common to both frameworks: the likelihood function. The data influence Bayesian calculations only by the way of the likelihood function, and this is also true of most Frequentist procedures. So Likelihoodists argue that one should simply interpret this function alone to understand what the data themselves say. In contrast, most Bayesians combine the likelihood function with a subjective prior probability about the hypotheses of interest and Frequentists combine the likelihood function with a decision theory understood in terms of two types of errors (See Mayo and Spanos’s chapter for a discussion of those types of errors.) The Likelihood function is the same no matter which approach is taken and Likelihoodists have been quick to point out that historic disagreements between Bayesian and Frequentists can be traced back to these additions to the likelihood function. Jeffrey Blume, however, argues that the disagreement between different statistical paradigms is really due to something more fundamental: the lack of an adequate conceptual framework for characterizing statistical evidence. He argues that any statistical paradigm purporting to measure the strength of statistical evidence in data must distinguish between the following three distinct concepts: (i) a measure of the strength of evidence (“How strong is the evidence for one hypothesis over another?”), (ii) the probability that a particular study will generate
12
Prasanta S. Bandyopadhyay and Malcolm R. Forster
misleading evidence (“what is the chance that the study will yield evidence that is misleading?”) and (iii) the probability that observed evidence — the collected data — is misleading (What is the chance that this final result is wrong or misleading?”) Blume uses the likelihood paradigm to show that these quantities are indeed conceptually and mathematically distinct. However, this framework clearly transcends any single paradigm. All three concepts, he contends, are essential to understanding statistics and its use in science for evaluating the strength of evidence in data for competing theories. One need to take note of the fact is that here “evidence” means “support for one hypothesis over its rival” without implying “support for a single hypothesis being true” or some other kind of truth-related virtues. Before collecting any data, Blume contends, we should identify the mathematical tool that we will use to measure the evidence at the end of the study and we should report the tendency for that tool to be mistaken (i.e., to favor a false hypothesis over the true hypothesis). Once data are collected, we use that tool to calculate the strength of observed evidence. Then, we should report the tendency for those observed data to be mistaken (i.e., the chance that the observed data are favoring a false hypothesis over the true one). The subtle message is that the statistical properties of the study design are not the statistical properties of the observed data. Blume contends that this common mistake — attributing the properties of the data collection procedure to the data themselves — is partly to blame for those historic disagreements. Let us continue with our earlier tuberculosis (TB) example to explore how Blume has dealt with these three distinct concepts, especially when they provide philosophers with an opportunity of investigating the notion of evidence from a different angle. Suppose it is known that the individuals with TB are positive 73.33% of the time and individuals without TB test positive only 2.85% of the time. As before, let H represent the simple hypothesis that an individual has tuberculosis and ∼ H the hypothesis that she does not. These two hypotheses are mutually exclusive and jointly exhaustive. Suppose our patient tests positive on an X-ray. Now, let D, for data, represents the positive X-ray result. Blume offers the likelihood ratio (LR) as a measure of the strength of evidence for one hypothesis over another. The LR is an example of the first concept; the first evidential quantity (EQ1). i h (E1) LR = P (D|H1)/P (D|H2)
According to the law of the likelihood, the data D, a positive X-ray in our example provide support for H1 over H2 if and only if their ratio is greater than one. An immediate corollary of equation 1 is that there is equal evidential support for both hypotheses only when LR = 1 (the likelihood ratios are always positive). Note that in (E1) if 1 < LR ≤ 8, then D is often said to provide weak evidence for H1 against H2, while when LR > 8, D provides fairly strong evidence. This benchmark, discussed in [Royall, 1997], holds the same meaning regardless of the context. In the tuberculosis example, the LR for H1 (presence of the disease)
Philosophy of Statistics: An Introduction
13
over it H2 (absence of the disease) is = (0.7333/0.0285) ≈ 26. So, the strength of evidence is strong and the hypothesis that the disease is present is supported by a factor of 26 over its alternative. Now that we know how we will measure evidence, we turn our attention to the study design. Blume’s second evidential concept (EQ2), the probability that a particular study design will generate misleading evidence, involves the probability of a future event. It is the probability that an investigator will collect a set of data that will yield an evidence measure in the wrong direction (i.e., that it will support the wrong hypothesis over the true one). To address the probability that a particular study design will generate misleading evidence, we examine the diagnostic properties of an X-ray for detecting TB (table 1 below). Table 1. X-ray results Disease present Disease is absent
Positive 73.33% 2.85%
Negative 26.67% 97.15%
The inference that a positive X-ray is evidence that the disease is present is correct regardless of a subject’s true disease status. This is because the LR = P (D|H1)/P (D|H2) = 0.733/0.0285 ≈ 26 regardless of the subject’s true disease status. However, the positive X-rays can yield evidence that is misleading. That is, it is possible for a non-diseased patient to have a positive X-ray. Table 1 shows that the probability of the positive test being misleading given that the disease is absent is only 2.85%. So this test would be considered a good test in the sense that this happens only 2.85% of times given the hypothesis. This probability called “the probability of misleading evidence” is an example of the second evidential concept (EQ2). Of course, we never know if an observed positive test result is truly misleading because we never know the underlying disease status of the patient. However, we can determine the tendency for observed positive test results to be misleading and this may be helpful in judging the reliability of the observed data. The third key concept (EQ3) involves the tendency for our final observed likelihood ratio to favor the false hypothesis. Unlike our discussion of EQ2 — the probability of observing misleading evidence — in the last paragraph, EQ3 conditions in the observed data set. At this stage, both the data and the likelihood ratio are fixed; it is the true hypothesis that is unknown. Therefore, we are interested in the probability that a hypothesis is true given the observed data. Back in our example, an observed positive result is misleading if and only if the subject in question does not have the disease. That quantity is P (∼ H|D). Similarly, one could construe an observed negative test result to be misleading if and only if the subject does have the disease. That quantity is P (H| ∼ D). Both quantities, P (∼ H|D), and P (H| ∼ D), can be computed using Bayes theorem as long we know the prevalence of tuberculosis in our testing population, P r(H), and the study design probabilities, P (D|H) and P (D| ∼ H).
14
Prasanta S. Bandyopadhyay and Malcolm R. Forster
P (H) is a prior probability and typically the specification of this probability would be subjective. But in our example it is easy to imagine that this probability is known and would be agreed upon by many experts. Here we will use a 1987 survey where there were 9.3 cases of tuberculosis per 100,000 population [Pagano and Gauvrau, 2000]. Consequently, P (H) = 0.0093% and P (∼ H) = 99.9907%. From an application of Bayes Theorem we obtain P (∼ H|D) = (0.028/0.0280) ≈ 1 and P (H| ∼ D) = 0.0025%. Thus the probability that an observed positive test result is misleading is nearly 100% because the disease is so rare. Nevertheless, our interpretation of the LR as strong evidence the disease is present is correct. It is wrong to argue that a positive test for TB is evidence that disease is absent! Blume and Royall [1997] independently provide less extreme examples, but this scenario shows why it is important to distinguish between the first and third evidential quantities (See Blume’s chapter for the full reference to this paper.) A positive test result is evidence that the disease is present even though in this population a positive test is not all that reliable (this is not the case for all populations of course, just those with very rare diseases). One of the problems of invoking prior probabilities, as we know, is the problem of subjectivity. The prior probabilities we have discussed in the diagnostic case study are frequency-based prior probabilities; as a result, the problem of subjectivity does not arise. However, there are often cases when the invocation of prior probabilities might be required to handle the third concept within the likelihood framework, thus potentially leading to subjectivity. At first this seems to pose a problem. However Blume shows that, for any prior, the probability that observed data are misleading is driven to zero as the likelihood grows. Large likelihood ratios are misleading less often. Therefore, the prior plays an important role in the starting point for this probability, but the data — acting through the likelihood ratio — will eventually drive this probability to zero. Thus, Blume argues, it is not critical to know the prior, since a large enough likelihood ratio will render the effect of any reasonable prior moot. In any case, we need to be aware of this dependence on which the third key concept might rely in some cases. Here, readers are invited to explore how Hawthorne’s Bayesian Likelihood Ratio Convergence Theorem could be compared and contrasted with the result Blume has been discussing in regard to likelihood framework. This third concept, the probability that observed evidence is misleading, helps bridge the gap between the likelihood and Bayesian frameworks: The Bayesian posterior probability is the probability that the observed evidence is misleading. What is missing in the Bayesian framework, Blume implies, are the first two concepts. Exactly what is the Bayesian measure of the strength of evidence and how often is that measure misleading? Likewise, Blume argues, the frequentists also must provide similar clarity. They cannot use a tail area probability for both the first and second quantities and then misinterpret that tail area probability as the third quantity once data are observed. Taking a cue from Royall’s work on likelihood framework and in some sense similar to Blume’s paper, Mark Taper and Subhash Lele have proposed a more
Philosophy of Statistics: An Introduction
15
general version of the likelihood framework that they call “evidentialism.” To give a motivation for why they think that likelihood framework is a special case of the latter here is some justification one of the authors provides elsewhere [Lele, 2004]. Lele defines a class of functions called “the evidence functions” to quantify the strength of evidence for one hypothesis over the other. He imposes some desiderata on this class of evidence functions, which could be regarded as epistemological conditions. Some of the conditions satisfied by the evidence function are: 1. The translation invariance condition: If one translates an evidence function by adding a constant to it to change the strength of the evidence, then the evidence function should remain unaffected by that addition of a constant. 2. The scale invariance condition: If one multiplies an evidence function by a constant to change the strength of evidence, then the evidence function should remain unaffected by that constant multiplier. 3. The reparameterization invariance: The evidence function must be invariant under reparameterization. It means that if there is an evidence function, Ev1 and the latter is reparameterized to Ev2, then both Ev1 and Ev2 must provide the identical quantification of the strength of evidence. 4. The invariance under transformation of the data: The evidence function should remain unaffected insofar as the quantification of the strength is concerned if one uses different measuring units. Lele adds two more conditions on the evidence function, which he calls “regularity conditions,” so that the probability of strong evidence for the true hypothesis should converge to 1 as the sample size increases. He demonstrates that the likelihood ratio becomes an optimal measure of evidence under those epistemological and regularity conditions, providing a justification for the use of likelihood ratio as a measure of evidence. Lele believes that showing the optimality of the likelihood ratio amounts to providing necessary and sufficient conditions for the optimality of the law of likelihood. According to the law of likelihood, observation O favors H1 over H2 if and only if P (O|H1) > P (O|H2). Taper and Lele also mention that information criteria, or at least order-consistent information criteria, are also evidence functions. By “order-consistent information criteria”, they mean that if the true model is in the model set, then sufficient amount of data should be able to find the true model. Having established the relationship between both the likelihood and information criteria paradigms and evidentialism, Taper and Lele explore the likelihood and information criteria, another pillar of frequentist statistics-error statistics. They distinguish between two concepts: global reliability and local reliability. They seem to be sympathetic with some sort of reliable account of justification which, roughly speaking, states that strength of evidence is justified if and only if it has been produced by a reliable evidence producing mechanism. Given their opinion that scientific epistemology is, or at least should be, a public and not a private epistemology, Taper and Lele are no friends of Bayesians because they think Bayesian
16
Prasanta S. Bandyopadhyay and Malcolm R. Forster
subjectivity infects the objective enterprise of scientific knowledge8 accumulation. They contend that “evidentialism” of their variety will track the truth in the long run.9 According to them, global reliability describes the truth-tracking or error avoidance behavior of an inference procedure over all of its potential applications in the long run. They argue that global reliability measures commonly used in statistics are provided by Neyman/Pearson test sizes (α and β) and confidence interval levels. These measures, according to them, describe the reliability of the test procedures (and not individual inferences) as a global reliability that is a property of a long-run relative frequency of repeatable events. Royall’s probability of misleading evidence is a similar measure of the long run reliability of evidence procedures to produce evidence. Taper and Lele distinguish this global reliability measure from its local version, which is the acquisition of or arriving at the truth in a specific scenario. They think that Fisher’s p-values and Mayo and Spanos’ concept of severity provide a local reliability measure. The probability of obtaining a value for the test statistic that is as extreme, or more extreme, is called the p-value. According Mayo and Spanos the test passes with severity with respect to specific observed outcome relative to a specific test (see section 2.2 for Mayo and Spanos’s view on severity). Taper and Lele define the notion of local reliability of the evidence that they call “ML ”, as the probability of obtaining misleading evidence for one model over the alternative at least as strong as the observed evidence. The smaller “ML ” is, the greater one’s confidence that one’s evidence is not misleading. What is significant in their paper is that their concept of evidence and reliability are distinct, and that both may be used in making inference. This differs from an error statistical analysis such as given by Mayo and Spanos. According to them, “ML ”, is clearly different from their notion of global reliability. It is also different from the third key concept (discussed in Blume’s chapter) that is interested in knowing the probability of the observed evidence to be misleading. To compute the probability of the observed evidence to be misleading, one needs to fall back on the posterior probability value; as a result, the measure associated with the third concept is open to the charge of subjectivity. However, ML does not depend on any posterior probability calculation. In the remainder of the paper, they connect philosophers’ work on “reliability” with their evidentialism, although they think that the latter is a research program which is evolving and thriving. As a result, more research needs to be carried out before we would have a fully developed account of evidentialism of their variety. 8 By “knowledge,” they must be meaning something different than what would be called “knowledge” in most epistemological literature in philosophy. They reject the truth of any comprehensible proposition of scientific interest and don’t believe that belief is a necessary component in any scientific enterprise concerning “knowledge.” 9 Any reader familiar with epistemological literature in philosophy might wonder about the apparent problem of reconciling “reliability” with “evidentialism” since evidentialists have raised objections to any account of justification that rests on reliabilism. However, Taper and Lele’s senses of “reliabilism” and “evidentialism” may not be mapped exactly onto philosophers’ locution. Hence, there is less chance of worrying about an inconsistency in their likelihood framework.
Philosophy of Statistics: An Introduction
2.5
17
The Akaikean information-criterion-based statistics paradigm
The last paradigm to be considered is the Akaikean paradigm. In philosophy of science, the Akaikean paradigm has emerged as a prominent research program due primarily to the work of Malcolm Forster and Elliott Sober on the curvefitting problem [Forster and Sober, 1994]. An investigator comes across “noisy data” from which she would like to make a reliable inference about the underlying mechanism that has generated the data. The problem is to draw inferences about the “signal” behind the noise. Suppose that the signal is described in terms of some mathematical formula. If the formula has too many adjustable parameters, then it will begin to fit mainly to the noise. On the other hand, if the formula has too few adjustable parameters, then the formula provides a small family of curves, thereby reducing the flexibility needed to capture the signal itself. The problem in curve-fitting is to find a principled way of navigating between these undesirable extremes. In other words, how should one trade off the conflicting considerations of simplicity and goodness-of-fit. This is a problem that applies to any situation in which sets of equations with adjustable parameters, commonly called models, are fitted to data, and subsequently used for prediction. In general, with regard to the curve-fitting problem, the goal of Forster and Sober is to measure the degree to which a model is able to capture the signal behind the noise, or equivalently, to maximizing the accuracy of predictions, since only the signal can be used for prediction, given that noise is unpredictable by its very nature. The Akaikean Information Criterion (AIC) is one possible way of achieving this goal. AIC assumes that a true probability distribution exists that generates independent data points. Even though the true probability distribution is unknown, AIC provides an unbiased estimate of the predictive accuracy of a family of curves that are fitted to a data set of that size under surprisingly weak assumptions about the nature of the true distribution. Forster and Sober want to explain how we can predict future data from past data. An agent uses the available data to obtain the maximum likelihood estimates (MLE) of the parameters of the model under consideration, which yields a single “best-fitting curve” that is capable of making predictions. The question is how well this curve will perform in predicting unseen data. A model might fare well on one occasion, but fail to do so on another. The predictive accuracy of a model depends on how well it would do on average, were these processes repeated again and again. AIC can be considered to be an approximately unbiased estimator for the predictive accuracy of a model. By an “unbiased estimator,” we mean that the estimator will equal the population parameter on average, with the avearge being relative to repeated random sampling of data of the same size as the actual data. Using AIC, the investigator attempts to select the model with minimum expected Kullback-Leibler (KL) distance across these repeated random samples, based on a single observed random sample. For each model, the AIC score is ˆ + 2k AIC = −2 log(L)
(E2)
18
Prasanta S. Bandyopadhyay and Malcolm R. Forster
Here “k” represents the number of adjustable parameters. The maximum likeˆ is simply the probability of the actual data given by the lihood, represented by L, model fitted to the same data. The fact that the maximum likelihood term uses the same data twice (once to fit the model and once to determine the likelihood of that the fitted model) introduces a bias; the same curve will not fit unseen data quite as well. The penalty for complexity is introduced in order to correct for that bias. Another goal of AIC is to minimize the average (expected) distance between the true distribution and the estimated distribution. It is equivalent to the goal of maximizing predictive accuracy. AIC was designed to minimize the KL distance between a fitted MLE model and the distribution actually generating the data. Specifically, AIC can be considered as an approximately unbiased estimator for the expected KL divergence which uses a MLE to create estimates for the parameters of the model. With this background, it is worthwhile to consider their contribution to the volume where Forster and Sober analyze the Akaikean framework from a different perspective. AIC is a frequentist construct in the sense that AIC provides a criterion, or a rule of inference, that is evaluated according to the characteristics of its long-run performance in repeated instances. Bayesians find the use of the frequency-based criteria to be problematic. Since AIC is a frequentist construct, Bayesians worry about the foundation of AIC. What Forster and Sober have done in their chapter is to show that Bayesians can regard AIC scores as providing evidence for hypotheses about the predictive accuracies of models. A key point in their paper is to point out that this difference of evidential strength between hypotheses about predictive accuracy can be interpreted in terms of the law of likelihood (LL), which is something that Bayesians can accept. According to the LL, observation O favors H1 over H2 if and only if P (O|H1) > P (O|H2). The secret is to take O to be the “observation” that AIC has a certain value and H1 and H2 to be competing “meta”-hypotheses about the predictive accuracy of a model (or of the difference in predictive accuracies between two models). According to Forster and Sober, Bayesians’ worry about AIC has turned out to be untenable because now AIC scores are evidence for hypotheses about predictive accuracy according to the LL which is one of the cornerstones of Bayesianism. Revisiting the four types of questions, we find that the Akaike framework is capable of responding to the prediction question. Forster and Sober maintain that, in fact, it is able to do this in terms of the evidence question because the prediction question can be viewed a sub-class of the evidence question. 3
THE LIKELIHOOD PRINCIPLE
So far, we have considered four paradigms in statistics along with their varying and often conflicting approaches to statistical inference and evidence. However, we also want to discuss some of the most significant principles whose acceptance and rejection pinpoint the central disagreement among the four schools. The likelihood
Philosophy of Statistics: An Introduction
19
principle (LP) is definitely one of the most important. In his chapter, Jason Grossman analyzes the nature and controversies surrounding the LP. Bayesians are fond of this principle. To know whether a theory is supported by data, according to Bayesians, LP says that the only part of data that is relevant is the likelihood of the actual data given the theory. The LP is derivable from two principles; the first is the weak sufficiency principle (WSP) and the second is the weak conditionality principle (WCP). The WSP claims that if T is sufficient statistic and if T (x1 ) = T (x2 ), then both x1 and x2 will provide equal evidential support. A statistic is sufficient, if given it, the distribution of any other statistic does not involve the unknown parameter θ. Both Bayesians and likelihoodists like both Royall and Blume accept the LP, whereas classical/error statisticians do not. Grossman clarifies the reason that classical/error statisticians question the principle. What data investigators might have obtained, but do not actually have, according to error statisticians, should play a significant role in evaluating statistical inference. Significance testing, hypothesis testing and confidence interval approach, which are tool-kits for classical/error statistics, violate the LP since they incorporate both actual and possible data in their calculation. In this sense, Grossman’s chapter overlaps with many papers in the section on “Four Paradigms,” especially Mayo and Spanos’ paper in this volume. Grossman also discusses the relationship between the law of likelihood (LL) and LP. They are different concepts; hence, supporting or denying one does not necessarily lead to supporting the other, and conversely. The LP only says where the information about the parameter θ is and suggests that the data summarization can be done through the likelihood function, in some way. It does not clearly indicate how to compare the evidential strength of two competing hypotheses. In contrast, the LL compares the evidential strength of two contending hypotheses. 4
THE CURVE-FITTING PROBLEM, PROBLEM OF INDUCTION, AND ROLE OF SIMPLICITY IN INFERENCE
The previous discussion of four statistical paradigms and how they address four types of questions provides a framework for evaluating several approaches to inductive inference. Two related problems that often confront any approach to inductive inference are the pattern recognition problem and the “the curve-fitting problem.” In both cases one is confronted with two conflicting desiderata, “simplicity” and “goodness-of-fit.” Numerous accounts within statistics and outside statistics have been proposed about how to understand these desiderata and to reconcile them in an optimal way. Statistical learning theory (SLT) is one such approach that looks at these problems while being motivated by certain sets of themes. Learning from a pattern is a fundamental aspect of inductive inference. It becomes all the more significant if a theory is able to capture our learning via pattern recognition in our day to day life as this type of learning is not easily suited to systematic computer programming.
20
Prasanta S. Bandyopadhyay and Malcolm R. Forster
Suppose for example that we want to develop a system for recognizing whether a given visual image is an image of a cat. We would like to come up with a function from a specification of an image to a verdict, a function that maximizes the probability of a correct verdict. To achieve this goal, the system is given several examples of cases in which an “expert” has classified an image as of a cat or not of a cat. We have noted before that there is no assumption-free inference possible in the sense that we have to assume that data at hand must provide some clues about the future data. Note that this assumption is a version of the uniformity of nature assumption. In order for the investigator to generate examples, she assumes that there is an unknown probability distribution characterizing when particular images will be encountered and relating images and their correct classification. We assume that the new cases of the examples that we will come across are also randomly sampled from that probability distribution. This is similar to what is assumed in the Akaikean framework, discussed earlier. We assume that the probability of the occurrence of an item with a certain characterization and classification is independent of the occurrence of other items and that the same probability distribution governs the occurrence of each item. The reason for having the probabilistic independence assumption is to imply that each new observation provides maximum information. The identical probability distribution assumption implies that each observation gives exactly the same information about the underlying probability distribution as any other. (These assumptions can be relaxed in various ways.) One central notion in SLT is the notion of VC-Dimension, which is defined in terms of shattering. A set of hypothesis S shatters certain data if and only if S is compatible with every way of classifying the data. That is, S shatters a given feature vectors if for every labeling of the feature vectors (e.g., as “cat” or “not cat”) the hypothesis in S generates this labeling. The finite VC-dimension of a set of rules C is the largest finite number N for which some set of N points is shattered by rules in C; otherwise the VC-dimension is infinite. The VC-dimension of C provided a measure of the “complexity” of C. Various learning methods aim to choose a hypothesis in such a way as to minimize the expected error of prediction about the next batch of observations. In developing their account about inductive inference, Gilbert Harman and Sanjeev Kulkarni in their joint paper have contended that SLT has a lot to offer to philosophers by way of better understanding the problem of induction and finding a reliable method to arrive at the truth. They note similarities between Popper’s notion of falsifiability and VC-dimension and distinguish low VC-dimension from simplicity in Popper’s or any ordinary sense. Both Harman and Kulkarni address those similarities between the two views and argue how SLT could improve Popper’s account of simplicity. Daniel Steel, while bringing in Popper’s notion of falsifiablity, has gone further to argue that the aim of Popper’s account seems different from the aim of SLT. According to Popper, the scientific process of conjectures and refutations generates ever more testable theories that more closely approximate the truth. This staunch realism toward scientific theories, according to Steel, is absent in SLT which aims at minimizing the expected
Philosophy of Statistics: An Introduction
21
error of prediction. Even though there is an apparent difference between these two approaches, Steel conjectures, there might be some underlying connection between predictive accuracy and efficient convergence to truth. Like SLT, both the Minimum Description Length (MDL) principle and the earlier Minimum Message Length (MML) principle aim to balance model complexity and goodness-of-fit in order to make reliable inferences. Like SLT, the MDL and MML, are also motivated by a similar consideration regarding how to make reliable inferences from data.10 In both these approaches, the optimal tradeoff is considered to be the one that allows for the best compression of the data, in the sense that the same information is described in terms of a shorter representation. In MML, it is important that the compression be in two parts: hypothesis (H1) followed by data given the hypothesis (D given H1). According to the MDL principle, the more one could compress a given set of data, the more one has learned about the data. The MDL inference requires that all hypotheses ought to be specified in terms of codes. A code is a function that maps all possible outcomes to binary sequences such that the length of the encoded representation can be expressed in terms of bits. The MDL principle provides a recipe regarding how to select the hypothesis: choose the hypothesis H for which the length of the hypothesis L(H) along with the length of the description of the data using the hypothesis LH(D) is the shortest. The heart of the matter is, of course, how these codes L(H) and LH(D) should be defined. In their introduction to MDL learning, Steven de Rooij and Peter Gr¨ unwald explain why these codes should be defined to minimize the worst-case regret (roughly, the coding overhead compared to the best of the considered codes), while at the same time achieving especially short code-lengths if one is lucky, in the sense that the data turn out to be easy to compress. By minimizing regret in the worst case over all possible data sequences, it is not necessary to make any assumptions as to what the data will be like. If one performs reasonably for the worst possible data set, then one will perform reasonably well for any data set. However, in general it may not even be possible to learn from data. Rather than assuming the truth to be simple, De Rooij and Gr¨ unwald introduce the alternative concept of luckiness: codes are designed in such a way that if the data turn out to be simple, we are lucky and we will learn especially well. The Minimum Message Length (MML) principle is similar to the MDL principle in that it is also interested in proposing a resolution for what we have called the curve-fitting problem. Like MDL, data compression plays a key role in MML. 10 There are considerable disagreements among experts on the SLT, MML, and MDL regarding which one is a more general theory than the other in the sense whether, for example, the SLT is able to include discussion of the properties of the rest. The SLT adherents argue that the SLT is the only general theory, whereas the MML adherents contend that the MML should be credited with the only approach which is the most general between the two. One MDL author, however, thinks that it is misleading to take any of the three approaches as any more fundamental than the rest. We are thankful to Gilbert Harman, David Dowe and especially Peter Gr¨ unwald for their comments on this debate. For a discussion of these entangled issues, see [Gr¨ unwald, 2007, chapter 17]. We, as editors, however, would like to report those disagreements among these experts without taking sides in the debate.
22
Prasanta S. Bandyopadhyay and Malcolm R. Forster
The more one could compress the data, the more we are able to get information from the data; moreover, the shorter the length of the code for presenting that information, the better will it be in terms of MML. One way to motivate either the MDL or the MML approach is to think of the length of codes in terms of Kolmogorov complexity in which the shortest input to a Turing Machine will generate the original data string (For Kolmogorov’s complexity and its relation to random sequences see section 7.3). This approach uses two-part codes. The first part always represents the information one is trying to learn, that is, of encoding the hypothesis H and then making the Turing Machine prepare to read and generate data, assuming that the data were generated by the hypothesis H encoded in the first part. In the first part of the message, the codes do not cause the Turing Machine to write. The second part of the message encodes the data assuming the (hypothesis or) model given in the first part, and then makes the Turing Machine write the data. In the use of two-part codes there is very little difference between MML and MDL. However, there is a fundamental difference between the two. MML is a subjective Bayesian approach in its interpretation of the used codes, whereas MDL eschews any subjectivism in favor of the concept of luckiness. The MML can exploit an agent’s prior (degree of) beliefs about the data generating process, but it can also attempt to make our priors as objective as possible in MML by using a simplest Universal Turing Machine. In his paper on the Bayesian information-theoretic MML principle, David Dowe surveys a variety of statistical and philosophical applications of MML, including relating MML to hybrid Bayesian nets. The relationship underlying MML is the idea of information theory where information is taken to be the negative logarithm of probability. This view has also led him to his two recent results: (i) the only scoring system which remains invariant under-reframing of questions is the logarithm of probability score, and (ii) a related new uniqueness result about the KullbackLeibler divergence between probability distributions. Dowe re-states his conjecture that, for problems where the amount of data per parameter is bounded above (e.g., Neyman-Scott problem, latent factor analysis, etc.), to guarantee both statistical invariance and statistical consistency in general, it appears that one needs either MML or a closely-related Bayesian approach. Using the statistical consistency of MML and its relation to Kolmogorov complexity, Dowe independently re-discovers Scriven’s human unpredictability as the “elusive model-paradox” and then, resolves the paradox (independently of Lewis and Shelby Richardson [1966]) using the undecidibility of the Halting problem. He also provides an outline of the differences between MML and the various variations of the later MDL principle that have appeared over the years (for references of the papers in the last paragraph, see Dowe’s paper in the volume.) The notion of simplicity in statistical inference is a recurring theme in several papers in this volume. De Rooij and Gr¨ unwald have addressed the role of simplicity, which they call “the principle of parsimony” in learning. Those who think that simplicity has an epistemological role to play in statistical inference contend that simpler theories more likely to be true. There could be two opposing camps in
Philosophy of Statistics: An Introduction
23
statistical inference regarding the epistemological role of simplicity. One could be Bayesians. The other could be non-Bayesian [Forster and Sober, 1994]. De Rooij and Gr¨ unwald have, however, identified the epistemological interpretation of simplicity with Bayesians. A likely natural extension of the epistemological construal of simplicity in statistical inference is to believe that simpler theories are more likely to be true. Subjective Bayesians subscribe to this epistemological account of simplicity. The hypothesis with a maximum posteriori probability is believed most likely to be true. De Rooij and Gr¨ unwald distance themselves from this interpretation, because the philosophy behind MDL aims to find useful hypotheses without making any assertions as to their truth. De Rooij and Gr¨ unwald assert that one fundamental difference between MDL, on the one hand, and MML and SLT, on the other, is that the former does not seem to have any form of the uniformity of nature assumption inbuilt in its philosophy that we find in the latter two. They would hesitate to assume that the data at hand necessarily provide clues about the future data. They prefer not to discount the possibility that it may not. Instead, they design methods so that we learn from the data if we are in the lucky scenario, where such is possible. According to them, this is a key distinction between the MDL approach and any other approaches including MML and SLT. In his paper, Kevin Kelly agrees with non-Bayesians like De Rooij and Gr¨ unwald about the Bayesian explanation of the role of simplicity in scientific theory choice. The standard Bayesian argument for simplicity, as already stated, is that simpler theories are more likely to be true. Bayesians use some form of Bayes’ theorem to defend their stance toward the role of simplicity in theory choice. This could take the form of comparing posterior probabilities of two competing theories in terms of the posterior ratio: P (S|D) P (S) P (D|S) = × , P (C|D) P (C) P (D|C)
(E3)
where theory S is simple (in the sense of having no free parameters) and theory C is more complex (in the sense of having free parameter θ that ranges, say, over k discrete values). The first-quotient on the right hand side of (E3) is the ratio of the prior probabilities. According to Kelly, setting P (S) > P (C) clearly begs the question in favor of simplicity. So he supposes, out of “fairness”, that P (S) is roughly equal to P (C), so that the comparison depends on the second quotient on the right hand side of (E3), which is called the Bayes factor. According to him, the Bayes factor appears “objective”, but when expanded out by the rule of total probability, it assumes the form: P (S|D) P (S) P (D|S) , = ×P P (D|Cθ )P (Cθ |C) P (C|D) P (C) θ
which involves the subjective prior probabilities P (C0 |C). Kelly’s point is that typically there is some value of θ such that P (D|S) = P (D|Cθ ). If P (C0 |C) = 1,
24
Prasanta S. Bandyopadhyay and Malcolm R. Forster
then the posterior ratio evaluates to 1 (the complex theory is as credible as the simple theory). But, in that case the parameter θ is not “free”, since one has strong a priori views about how it would be set if C were true. To say that is “free” is to adopt a more or less uniform distribution over k values of θ. In that case, the posterior ratio evaluates to k — a strong advantage for the simple theory that becomes arbitrarily large as the number of possible values of θ goes to infinity. But, objectively speaking, C0 predicts D just as accurately as S does. The only reason C is “disconfirmed” compared to S in light of D is that the subjective prior probability P (C0 |C) = 1/k gets passed through Bayes’ theorem. Kelly concludes, therefore, that the Bayesian argument for simplicity based on Bayes’ factor is still circular, since it amounts to a prior bias in favor of the simple world S in comparison to each of the complex possible C0 . Kelly proposes a new, alternative explanation of Ockham’s razor that is supposed to connect simplicity with truth in a non-circular way. The explanation is based on the Ockham Efficiency theorem, according to which Ockham’s razor is the unique strategy that keeps on the most direct path to truth, where directness is measured in terms of jointly minimizing course reversals en route to the truth and the times at which these courses reversals occur. Since no prior probabilities are involved in the theorem, Kelly maintains that it does not beg the question the way simplicity-biased prior probabilities do. Furthermore, since Kelly views directness of pursuit of the truth as a concept of truth conduciveness, he views the Ockham Efficiency Theorem, as a foundation for scientific inference rather than instrumentalistic model selection. In that respect, he parts company with the antirealism of De Rooij and Gr¨ unwald, who maintain, in light of the MDL approach, that simplicity plays primarily a heuristic role in making a theory useful. Kelly argues that in the case of causal theory choice from non-experimental data (see section 6 below), getting the causal arrows backwards yields extremely inaccurate policy predictions, so the Ockham Efficiency Theorem, according to Kelly, is the only available, non-circular foundational viewpoint for causal discovery from non-experimental data. 5
RECENT ADVANCES IN MODEL SELECTION
We touched on the model selection problems when we discussed the curve-fitting problem, especially with regard to the AIC paradigm in statistics. Model selection is such a hotly debated topic in statistical literature that it deserves a special place in our volume. However, model selection has a narrowly focused area in statistics in which investigators are confronted with choosing the model that provides the best possible account for the underlying mechanism generating the data and the role simplicity contributes to their choice. It is a delicate question whether there is a characteristic difference between theory choice and model selection. In revolutions in physical theories, Einstein’s theory replaced the Newtonian theory or a new disciple like bio-chemistry emerged as the theory of double helical structure of the DNA was discovered. However, model selection cannot or should not de-
Philosophy of Statistics: An Introduction
25
cide between two contending theories, as we see in the case of physical theories. In contrast, model selection addresses a narrow topic with specific problems, like whether the data in question are coming from a normal distribution or from other non-normal distributions. Setting the model selection issues in this context of theory choice in advanced physical theories, Arijit Chakrabarti and Jayanta Ghosh address two widely discussed model selection criteria: AIC and Bayesian Information Criterion (BIC). BIC says that Prob (Hk | data) is proportional to the log-likelihood of the sample n multiplied by n−k/2 . Surveying the literature in statistics and computer science where model selection criteria are extensively applied, Chakrabarti and Ghosh distinguish between the purposes for which AIC and BIC are introduced. They argue that the BIC is more useful when the true model is included within the model space. In contrast, the AIC is more effective in predicting the future observation. This depiction of the difference between the BIC and AIC sets the tone for a peaceful co-existence of both model selection criteria; too often, statisticians and philosophers are involved in “statistics war” pleading for the superiority of one criterion over the other. This also provides an elegant connection between the theme echoed in Chakrabarti and Ghosh’s paper on the one hand, and Norton’s and Kelly’s papers on the other. The common theme that has been shared by all four authors is that there should not be any such account of inductive inference that could defend “one-size-fits-all” philosophy. Although both Norton and Kelly are non-Bayesians and their targets of criticism are Bayesians, Chakrabarti and Ghosh are themselves Bayesians and their targets could be both Bayesians and non-Bayesians. A.P. Dawid in his chapter has also focused on Bayesian model selection issues. Within a Bayesian framework, the conventional wisdom about the problem of model choice is that model choice is more sensitive to the prior than is standard parametric inference. Working with a Bayesian framework, he, however, rejects this conventional wisdom based on the study of the asymptotic dependency of posterior model probabilities on the prior specifications and the data. Moreover, the paper suggests a possible solution to the problematic Bayes factor with respect to improper priors by specifying a single overall prior and then focusing on the posterior model distributions. The general theory has been illustrated by constructing reference posterior probabilities for both normal regression models and analyses of an ESP experiment. However, in a sense, Dawid’s interest is much broader than just model selection issues. He is interested in Bayesian inference. Consequently, he has developed a general Bayesian theory of inference with a specific type of posterior probability construction. 6
CAUSAL INFERENCE IN OBSERVATIONAL STUDIES
Whether we engage in statistical inference or are involved with statistical evidence or model selection problem, data are enormously important for statistics and making reliable inference. Often, data or observations are randomized. Frequently, we
26
Prasanta S. Bandyopadhyay and Malcolm R. Forster
do not have the luxury of randomly sampling our observations so that our inference would be reliable in the sense of minimizing biases or systematic errors. Those cases arise in observational studies when we have no randomization option, but we still need to make inferences. Sometimes, we also make causal inferences based on observational studies. Peter Spirtes and Sander Greenland’s two papers address these issues regarding how one could successfully make a reliable causal inference when topics in question involve observational studies. In building a causal model from observational studies, Spirtes distinguishes between two kinds of questions, (i) qualitative and (ii) quantitative. A qualitative question asks “will manipulating a barometer reading affect subsequent rainfall?” A quantitative question asks, “how much does manipulating a barometer reading affect subsequent rainfall?” With observational studies, one might be confronted with several issues about how to make a reliable qualitative or quantitative inference. Spirtes is interested in making reliable causal inference in observational studies when two variables may be connected by an unmeasured common cause. For example, the correlation between two observed variables barometer reading and rainfall could be due to an unmeasured (latent) common cause such as atmospheric pressure; without further background knowledge the correlation could also be due to the barometer reading causing rainfall. Without some substantive causal assumptions of this entire causal setup, one cannot hope to make any reliable inference about which of these causal relationships between barometer reading and rainfall is correct simply from observing the correlation between them. Sometimes attempts are made to make reliable inferences about the causal relationship between barometer reading and rainfall by using temporal information and conditioning on as many potential causes as possible prior to the occurrence of the putative cause. However, in this scenario there is still no way to make reliable causal inferences without substantive causal assumptions regarding whether all the relevant variables are conditioned on; in addition standard causal inference methods can be misled by conditioning on too many variables, as well as too few. Spirtes discusses some general assumptions relating causal relationships to statistical relationships, which causal inferences cannot be reliably made under these assumptions when unmeasured common causes may be present, and alternative reliable methodologies for reliable causal inferences when it is possible. Like Spirtes, Greenland is also interested in causal inference in observational studies. However, he is focused on a manipulative account of causation where an investigator wants to observe a particular outcome in a subject after she is given a treatment. This model is called the Potential outcome or Counterfactual model. In the potential outcome model, the investigator approaches the issue by defining the effects of a cause or intervention and then defining the effects of treatment. Suppose, we would like to observe the effects of different doses of AZT on patients affected with HIV virus. We would compare the one year mortality rate that would occur in a given population after they are administered certain doses of AZT or none at all. Here, the range of intervention in terms of patient’s doses of AZT could very well correspond to different levels of their physical discomfitures, in
Philosophy of Statistics: An Introduction
27
which the potential outcomes could be their length of survival, e.g., if the patients survive another year or not. Or it could be that their survival time range from one year to another five year from the day of the introduction of the drug to them. In the deterministic situations like the case of the AZT doses and the survival rate of the patients, the purpose of observational studies, according to Greenland, is to find a causal link connecting treatments of the patients to their recovery/survival rate. In the deterministic world, we use probability to capture uncertainty, although ambiguity remains if the intervention is not completely specified. The situation becomes more complex in a stochastic world, such as in quantum world, where it is not even theoretically possible to know the future fully. The problem can be dealt with by allowing the potential outcomes to refer to distributions (or parameters) instead of realizations [Greenland et al., 1999]. In the present chapter, however, Greenland confines himself to the deterministic world. In articulating the potential outcome model for such a deterministic world, he discusses the role of causal laws (expressed in terms of some structural equations) in which one shifts from the understanding of the counterfactual (if you would have administered treatment x instead of y to the patient, then what could have happened to her?) to the understanding of the law connecting the outcome variable to the antecedent variable. The law in question could be “if one drops both a feather and a coin at the same time from a height to the ground where there is no air-resistance, then they would reach the ground simultaneously.” Greenland thinks that if we find that the predictions of the theory have turned out to be false more often than not, then they would provide reasons to question its tenability. As students of philosophy of science we know, however, that the falsity of predictions should not necessarily lead to the rejection of that theory as its various auxiliaries could be called into question rather than the theory itself because of its wrong predictions. Similar tangled issues could crop up in studying causal inference in observational studies To circumvent this possible objection that the falsity of predictions should not necessarily lead to the wholesale rejection of the theory, Greenland has demonstrated the benefits of the use of causal diagrams in similar and other relevant situations. A “causal system” as he calls it consists of the output w of a function f (u, v) which might become an input to a later function g(u, w) with output x. The arrows in the diagram representing this system are to be taken as causal arrows connecting input variables to output variables. The advantage of this kind of diagrams lies in its ability to do “local surgeries” if and when required. They are devised in such a way that they are able to isolate the effect of that rejection mentioned above in that causal network of relations among variables, and thus are able to pinpoint which specific part of theory or its auxiliaries have to be called into question when predictions of the theory don’t result in what is expected. While developing the potential outcome models along with their structural equation generalizations, he makes the reader aware of the possible theoretical problems of this model. However, he concludes with a pragmatic note. What matters most, according to him, is the practical significance of these causal diagrams applied to
28
Prasanta S. Bandyopadhyay and Malcolm R. Forster
varied situations, although many foundational questions about the model are yet to be satisfactorily resolved. 7
SPECIFIC TOPICS OF INTEREST
We began this “Introduction” with the types of questions/problems with which philosophers are often confronted. We have seen that they are more often interested in general epistemic and causal issues, including issues about the foundations of statistical inference. Philosophers who are trained primarily in analytic tradition are sometimes interested in simple/specific problems. Nonetheless, it would be a mistake to think that only philosophers are interested in specific problems. This section examines how statisticians, mathematicians, and computer scientists along with philosophers are busy working out the details of some of the specific problems in the topics of their interest.
7.1
Conditional probability
One of these specific problems in which various philosophers have lately been getting interested is “what is conditional probability?” The conditional probability of P (H|D) is traditionally defined as P (H&D)/P (D) when P (D) is greater than zero. Here, as before, “H” stands for a hypothesis and “D” stands for data. Alan H´ajek in his contribution suggests that although this way of understanding conditional probability is common and has been proposed by Kolmogorov, it is actually a conceptual analysis of conditional probability. According to H´ajek, one is free to stipulate that a particular technical expression is a short-hand for a particular concept, but one is not free to assign that concept any meaning he/she chooses. Conditional probability is not simply an abbreviation; rather, it is answerable to certain pre-theoretical intuitions. Thus, although we could choose to make “P (H|D)” a shorthand for this ratio, he argues, we do not have any good reason to identify the expression “the probability of H given D” with this ratio. He evaluates several arguments for the ratio analysis, and ultimately rejects them. Instead, he argues, conditional probability should be taken as primitive. One of the arguments he offers for dispensing with the ratio analysis is by way of illustrated by an example. What is the probability that a coin turns up heads given that I toss it fairly? It should be 12 . The problem, H´ajek contends, is that according to the ratio analysis, the conditional probability is the ratio: P (the coin lands heads & I toss the coin fairly)/P (I toss the coin fairly) and both unconditional probabilities need not be defined. After all, he argues, “you may simply not assign them values. After some thought, you may start to assign them values, but the damage has already been done; and then again, you may still not do so.” The damage that has been done is that there is a time at which an agent assigns a conditional probability ( 12 ) in the absence of the corresponding unconditional probabilities required by the ratio formula. As is evident, this is an immediate counterexample to the ratio formula. It does not save the
Philosophy of Statistics: An Introduction
29
formula from this counterexample if later on the agent in question happens to assign the requisite unconditional probabilities; the counterexample concerned the earlier time, and its existence provides a refutation of the ratio analysis of the conditional probability. However, in this example, H´ajek thinks that there is a clear-cut conception of conditional probability according to which the answer is 21 . He explores the fundamental nature of conditional probability in the various interpretations of probability. He canvases other arguments against the ratio analysis: cases in which the condition (D) has probability 0 (and thus can’t appear in the denominator of the ratio), or unsharp probability, or vague probability. (These notions are explained in sections 4.2 and 4.3. of his chapter) He also shows how conditional probability plays a key role in various paradoxes, including Simpson’s paradox (see section 7.2 for an example and exposition.) One highlight of H´ ajek’s paper is to investigate the debate over whether the notion of conditional probability is a primitive or a defined concept. In the usual way that we learn probability theory, conditional probability is defined in terms of two unconditional probabilities. Since he thinks that we have an intuitive understanding of conditional probability, he recommends reversing the order of analysis. Like Popper, H´ ajek argues that one should take conditional probability, P (, ) to be fundamental and unconditional probability to be derivative: the unconditional probability of a is P (a, T), where T is a logical truth.11 H´ajek presents Popper’s axioms and postulates of primitive conditional probability functions, known as the Popper Functions. This mathematical set-up offers a rival approach to Kolmogorov’s. H´ ajek also points out that Popper’s approach has no difficulty in handling the objections raised against the ratio analysis. Building on H´ ajek’s paper, Kenny Easwaran has both addressed and evaluated many of the issues raised by H´ ajek. He agrees with H´ajek that the ratio analysis of the conditional probability is faulty. However, he disagrees sharply with H´ ajek regarding his claim that conditional probability rather than unconditional probability should be regarded as primitive. Easwaran thinks that H´ajek’s argument in this connection needs to be separated for each interpretation of probability rather than taking it as one argument that cuts across all interpretations together. He thinks that, especially with regard to subjective interpretation of probability, a distinction needs to be made between between P (H|D), where “D” are themselves some possible event later in time than t, and P (H|D), where “D” encompasses a complete description of the history of the universe up to time t. Easwaran contends that although it is evident that all non-logical interpretations (for example, the subjective or propensity interpretation) of probability depend on some information in order for some probabilities to be assigned to some particular cases, the role this information plays could very well vary. Thus, he concludes that H´ajek has not provided a good argument for the claim that the notion of conditional probability should be counted as a primitive notion rather than the notion of the unconditional probability. 11 H´ ajek prefers to reserve the notations, “P (|)” for the ratio analysis and “P (, )” for the conditional probability being primitive.
30
7.2
Prasanta S. Bandyopadhyay and Malcolm R. Forster
Probabilistic and statistical paradoxes
Paradoxes, albeit entertaining, challenge our intuitions in a fundamental way. There is no exception to this whether we are concerned with probabilistic or statistical paradoxes. It is often hard to distinguish between probabilistic paradoxes and statistical paradoxes. One could, however, propose that all statistical paradoxes are probabilistic paradoxes, whereas probabilistic paradoxes are not exactly statistical paradoxes. This leaves a more basic question unanswered: “what is a statistical paradox?” A statistical paradox is a paradox that can be understood in terms of notions that need not be probabilistic, for example, “confounding,” “intervening,” and so on. It still might not settle the dispute about the differences between statistical and probabilistic paradoxes, because probability theorists could contend that notions like “confounding” are subject to probabilistic reduction. It is, however, sufficient for our purpose to appreciate the difficulty associated with giving a clear-cut definition of either of the terms as we proceed to discuss probability paradoxes. One way to understand the difference between the two is to consider the “Monty Hall problem” (named after a television game show host), which, among many other paradoxes, Susan Vineberg discusses in her chapter. Suppose you are a contestant on a quiz show. The host, Monty Hall, shows you the three doors (A, B, and C). Behind one door is an expensive new car and behind others are goats. You are to choose one door. If you choose the door with the car, you get it as a prize. If you choose a door with a goat, you get nothing. You announce your choice, and Monty Hall opens one of the unchosen doors, showing you a goat, and offers to let you change your choice. Should you change? Three crucial points need to be clearly stated which are usually overlooked in a popularized version of the Monty Hall problem. They are (i) the expensive car has an equal chance of being distributed behind any of the doors. (ii) If you choose one door without the prize behind it, Monty will open the door that does not have the prize behind it. And (iii) if you choose the door behind which there is a prize behind it, then Monty will open the door randomly which does not have the prize behind it. Given this setup, let us calculate the probability of opening the door B (Monty B) when the prize is behind A (A). Suppose the contestant has chosen door A. The probability of Monty’s opening door B given that the prize is behind door A equals P (B|A) = 21 . (Recall that once you have chosen the door A Monty is not going to open it. So Monty is left with two choices.) Compute the probability of Monty’s opening the door (Monty B) when there is the prize behind the door B. P (Monty B|B) = 0 ( Recall that he is not going to open the door behind which there is the prize). Compute the probability of opening the door B when there is the prize behind door C. P (Monty B|C) = 1 ( because the contestant has chosen door A and the prize is behind door C. He is left with no other option than to open door B.)
Philosophy of Statistics: An Introduction
31
We would like to know whether you should stay where you are (A) conditional (given) on the information that Monty has opened door B. That is, P (A| Monty B). According to the Bayes’ theorem, P (A| MontyB)
=
((P (A) × P (MontyB|A))÷ {((P (A) × P (MontyB|A)) + ((P (B) × P (MontyB|B))+ ((P (C) × P (MontyB|C))}
Consider P (A) × P (MontyB|A) = 1/3 ×
1 2
= 1/6
P (B) × P (Monty B|B) = 1/3 × 0 = 0. P (C) × P (Monty C|C) = 1/3 × 1 = 1/3 1/6 Now P (A| MontyB) = 1/6+0+1/3 = 1/6 ÷ 21 = 1/3. Thus, if you switch, you will get 2/3 of a chance of winning. Vineberg has considered the Monty Hall problem to be a probability paradox. One needs to be careful as to her reason for classifying this problem to be a paradox of probability. She takes it to be a paradox of probability because probabilities are appealed to in the reasoning that gives us the paradox, and not because of its solution which rests on using Bayes’ theorem. This paradox might be said to be essentially probabilistic, because it is resolved by highlighting the correct use of Bayes’ theorem, which yields 1/3 along with the assumptions noted before. In contrast, Simpson’s paradox, another paradox that Vineberg considers, can be considered to be both a statistical and a probability paradox at the same time. Simpson’s Paradox involves the reversal of the direction of a comparison or the cessation of an association when data from several groups are combined to form a single whole. Suppose you are considering the acceptance rates for males and females in a graduate program that includes two departments. Consider an example of Simpson’s paradox.
Table 2. Simpson’s Paradox CV F M
Dept. 1 Accept Reject 180 20 480 120
Dept. 2 Accept Reject 100 200 10 90
Acceptance Rates Dept. 1 Dept. 2 90% 33% 80% 10%
Overall Acceptance Rates 56% 70%
Here, “CV” includes two categorical variables, “F” for “females” and “M” for “men.” “A” and “R” represent “the rates of acceptance/rejection” for two departments, D1 , and D2 . Here is a formulation of the paradox, in which the association in the subpopulations is reversed in the combined population. Although the acceptance rates for females are higher than for males in each department, in the combined population, the rates have reversed. Vineberg has explained why it could be explained within the confines of the probability theory. However, as we know, it could also be regarded as a statistical paradox because notions like “confounding,”
32
Prasanta S. Bandyopadhyay and Malcolm R. Forster
which many statisticians consider to be statistical notions, could be used to explain it. In the above example of the paradox, for example, the effect on acceptance (A) of the explanatory variable, sex (S), is hopelessly mixed up (or “confounded”) with the effects on A of the other variable, department (D). According to some statisticians, we are interested in the direct effect of sex on acceptance and not an indirect effect by way of another variable such as department. The effect of S on A is confounded with the effect on A of a third variable D. Although Vineberg is fully aware of this difficulty of characterizing a paradox completely in terms of the probability theory or statistics, she confines herself to the probability paradoxes, investigating possible reasons for regarding them as probability paradoxes. She counts all of the paradoxes that she has considered paradoxes of probability because the reasoning (and/or premise) involved in them is probabilistic, and not because of the form of their solution (for a clear understanding of the relationship between probabilistic reasoning and probabilistic/inductive inference, see the concluding section of Bandyopadhyay and Cherry’s chapter.) On her account, Simpson’s paradox is a paradox of probability is due to this specific reason. She argues that some of the paradoxes arise from misapplications of the rules of the probability theory. Some of the resolutions of the paradoxes (like the Monty Hall problem) are less controversial, whereas others are more controversial; as a result, they are hotly debated. She cites Newcomb’s problem as one such paradox in the latter category since it is liable to several competing solutions requiring radical rethinking about the foundations of Bayesian decision theory (see [Savage, 1972] for classical Bayesian decision theory).12 The rationale she offers for not regarding it as a paradox of probability is that the reasoning leading to this problem requires non-probabilistic decision theoretic considerations. C. Andy Tsao reviews two statistical paradoxes: (i) Lindley’s paradox and (ii) the Fieller-Creasy problem. There are already some discussions on those paradoxes in the literature, since both arose in the early 1960s. We will confine ourselves to Tsao’s discussion of the Lindley’s paradox. To set the paradox in a broader perspective, consider the formulation of the hypothesis testing problem where two hypotheses are H0 = the null hypothesis and H1 = the alternative hypothesis. There are infinitely many tests possible regarding the tenability of a hypothesis; some are conservative and some are liberal. In this connection, the term “conservative” (“liberal”) refers to a specific property of the tests that have a smaller (larger) probability of rejecting the null when the alternative is typically a new, or previously uncontested statement of a hypothesis/theory. Let us assume that H0 = the suspect is innocent vs. H1 = the suspect is guilty. A conservative judge will presume the innocence of the suspect even at the cost of letting some guilty walk free, while controlling the rate of error associated with judging the innocent to be guilty. “Is this statistical procedure justified” is one basic question 12 In the last ten to fifteen years, in several fields a great deal of research has been done on Bayesian decision theory. However, for Bayesian decision theory there are very few books which are better than Berger [1985]'s ’ classic book written from a statistician’s point of view. There are also a couple of recent good books on the same issue in philosophy. Joyce [1999] has addressed Bayesian decision theory from a casual decision-theoretic perspective. See also Weirich [2004] for a realistic non-ideal agents’ decision theory.
Philosophy of Statistics: An Introduction
33
of statistical decision theory. Depending on the criterion, one could consider this procedure to be problematic, whereas under other criterion, one could deem it to be justified. However, we also have a sense of what counts as “justified,” and statisticians often fall back on it for supporting their stance toward evaluating a theory. Lindley’s paradox demonstrates that the common belief about classical statistics being conservative is mistaken. Consider why conservatism has been usually associated with classical statistics. The confidence coefficient on which the idea of a confidence interval rests is the smallest coverage probability among all possible θ rather than the average of them or the size of a test is the largest of probability of Type I error for all possible θ in the null parameter space. Classical statistics is taken to be more conservative in rejecting the null. It is harder to reject the null if one’s tests are conservative. Consider two values, “a” and “b”, denoted by a(x), b(x) as functions of the observed data. These two values are such that for any x, we have a(x) < b(x). Thus, in this hypothesis testing setup, we will reject H0 if a(x) < α (similarly b(x) < α). Therefore, under this scenario, using b(x) will be harder to reject H0 than using a(x). According to Tsao’s analysis, classical statistics, however, has turned out to be less conservative than its Bayesian counterpart. In the literature, many classical/frequentist statistics are shown to be minimax procedure-based, in a way that they work best in the worst scenarios, therefore, they are conservative. Lindley’s paradox is a paradox in the sense that given the same data and setting, the “frequentist” p-value has become substantially smaller (less than α) than the minimum of “reasonable” Bayes estimates (greater than 1-α). We need to mention a couple of especially worthwhile points about the paper. Lindley’s paradox, according to Tsao, is mathematically correct. This, in fact, points out that frequentist and Bayesian procedures may lead to different conclusions. Given the same data, a frequentist using p-value may reject H0 , whereas, according to Bayesian use of the posterior probability argument, there is a very high probability for H0 to be true. What his paper brings out is that those statistical procedures derived from frequentist and Bayesian principles have different types of guarantee. Frequentist hypothesis testing theory (Neyman-Pearson formulation) guarantees that the long run frequency of the Type I error will be less than α while Bayesian procedure will minimize the posterior expected loss function. In decision theory, a Bayesian estimator is a decision rule that minimizes the posterior expected value of a loss function. These two criteria are foundationally different and not necessarily lead to same procedure nor the corresponding conclusion One needs to be careful that it not because of the goals of the two paradigms, (i) classical statistics, and (ii) Bayesian statistics, are different that the conclusions are different. Tsao contends in fact that the goal of both paradigms in this specific case is the same: that is, to assess which hypothesis H0 or H1 is more likely to be the correct hypothesis. Typically, the frequentist procedures perform well when used repeatedly in the long run while the Bayesian procedure work well for the
34
Prasanta S. Bandyopadhyay and Malcolm R. Forster
current experiment if the prior agrees well with the parameter uncertainty. The readers will be able to find that some of the issues raised in this paper overlap with some of the issues covered both in Mayo and Spanos’s paper on error statistics, and in Weirich’s paper on Bayesian theory when Weirch contrasts the latter with classical statistical theory.
7.3
On randomness
Like probability and statistical paradoxes, the concept of randomness is a favorite topic for many probability theorists, philosophers, and other scholars. This concept initially plays a crucial role in understanding the frequency interpretation of probability. The frequency account of probability requires that the probability of an event is the long-run relative frequency in a sequence of repeatable events. The long-run relative frequency in a sequence of repeatable events is an objective property that the frequency theorists could measure and sample. To estimate long-run relative frequencies, we would like to measure the relative frequency in a random sub-sequence. We have two papers, one by Deborah Bennett and the other one by Abhijit Dasgupta, which discuss this elusive notion. Bennett provides an accessible introduction to the topic detailing some of the problems associated with defining “randomness.” Dasgupta has provided rigorous mathematical foundations for randomness; he traces both its history and the mathematics leading to our present-day understanding of the topic. Since the topic itself along with its mathematical history could be of interest to our various readers, we will provide a background for both articles. We address the topic from three interrelated ways, borrowing insights from Dasgupta’s paper: (i) the statement and scope of the problem, (ii) an historical outline, and (iii) the three paradigms of algorithmic randomness. (i)
Statement and scope of the problem:
We begin with two intuitive views about randomness and a problem regarding which/when sequences can be considered random. We also contrast this view with what Dasgupta calls the extensional “black-box view.” Consider some examples that seem “random” from our intuitive understanding of what counts as random. Among the examples of these processes are the following: flipping a coin, turning a gambling wheel, taking a snapshot of weather data, measuring the time between two successive clicks of a Geiger counter detecting radioactive decay, or collecting the stock market index value. The outcomes of any such process is recorded as a list of symbols or numbers, technically called a sequence. Here, we assume that the list of outcomes is available to us only as a sequence of observations. The inner workings of the process is assumed to be unknown and will not concern us in any way. This is why the view is called the extensional “blackbox” view. This way, we completely “decouple” the generated sequence of outcomes from the process generating it.
Philosophy of Statistics: An Introduction
35
The main problem can now be stated in the following: Find a precise criterion to distinguish, among the collection of all possible sequences, the random ones from the non-random ones. In other words, which sequences are random and which are not? Here is a further simplification: Consider the example of repeatedly flipping a fair coin, where heads and tails are equally likely and those flips are probabilistically independent of one another. Denoting heads by 1 and tails by 0, the outcomes of this process can be represented as the collection of all binary sequences like 1010011010. . . . Let us call this probability space the binary sequences with uniform probability. Restricting our attention to this special case, our main problem becomes this: Find a precise criterion to distinguish the random binary sequences from the non-random ones (for the uniform probability case). It is a remarkable but technical mathematical fact that this special case is able to simulate and model other complex processes in a faithful way, hence the restriction is not as severe as it may first appear. From now on, we only work with binary sequences with uniform probability. (ii) An historical outline: Our outline consists of seven short paragraphs showing the historical development of the problem about randomness leading to the present state of research on this. 1. Absolute lawlessness is impossible! The first intuitive and imprecise approach to the problem is to classify the random binary sequences as the ones whose bit-patterns obey no “law of regularity” whatsoever. Unfortunately, this intuitive approach is incorrect, since standard mathematical results (e.g., van der Waerden’s theorem) show that all binary sequences satisfy certain laws of regularity. In fact, as Borel showed in 1909, random sequences, instead of being completely lawless, will satisfy a law of frequency stability. 2. 1909: Borel’s strong law of large numbers. In 1909, Borel proved his strong law of large numbers, which is a law of frequency stability, and is the first significant law of randomness. It states (recall our restriction to the uniform probability case) that the probability is one that in a binary sequence the proportion of 1’s among the first n terms approaches the value 21 in the limit as n approaches infinity (the frequency of 1’s stabilizes to the value 12 ). While Borel’s law of randomness is satisfied by all random binary sequences (see footnote 9 of Dasgupta’s article for an explanation), it fails to capture the notion of randomness since there are also many non-random sequences satisfying it. Thus, Borel’s strong law is a necessary but not sufficient condition for randomness, and so, in particular, it is not a criterion for randomness. 3. 1919: Von Mises defines randomness using frequency stability. In 1919, Richard von Mises turned things around and gave a definition of, i.e. a criterion for, randomness in terms of frequency stability. He realized the
36
Prasanta S. Bandyopadhyay and Malcolm R. Forster
fundamental importance of frequency stability for defining randomness, and his definition of randomness is highly intuitive and appealing. A binary sequence, according to him, is random if and only if after erasing any “admissible” part of the sequence, the remaining sequence still satisfies Borel’s condition of frequency stability. This was considered to be the first mathematical definition of randomness. The idea of von Mises can be stated as follows: The randomness of a sequence is equivalent to its inherent unpredictability, which amounts to unbiasedness (frequency stability) in all its admissible parts. For von Mises, this also served as the foundation for the frequentist theory of probability that he was developing during this period. 4. 1940: Church brings in algorithms, forever. The definition of von Mises, while being highly intuitive, was not precise mathematically, since he did not specify what the term “admissible” in his definition really means. In 1940, Church rectified the situation by bringing in the notion of algorithms to rigorously and precisely interpret the word “admissible” in von Mises’ definition. This gave the first mathematically precise definition of randomness. In addition, Church’s use of algorithms turned the subject of randomness permanently algorithmic, and the use of the notion of algorithm became increasingly important and relevant in the study of randomness. However, as will see, Church’s modification of von Mises’ definition has turned out to be inadequate. 5. 1965: Martin-L¨ of finds the first satisfactory definition of randomness. Even after Church made von Mises’ definition of randomness mathematically precise, the definition remained inadequate. Ville showed that Church’s modification of von Mises’s definition was again only a necessary but not sufficient condition for randomness, and therefore is still not a criterion for randomness. To provide a rough understanding about Martin-L¨of’s definition, call a property of sequences special if and only if the probability that a sequence has this property is zero. Some examples of special properties are “eventually constant”, every third term is 0”, “there is no run of zeros of length five”, etc. If such a special property S is effectively specified, we observe first that the terms of any sequence satisfying S will be, roughly speaking, so contained that the sequence can’t be random. Second, by definition, the probability is zero that a “random” sequence will have this property S. Thus, a special property makes a sequence highly non-random, and, it is impossible for a random sequence to satisfy an effectively specified special property. We can now classify all sequences into two types. A sequence is special if and only if it satisfies at least one effectively specified special property. Otherwise, it is typical. If a sequence is to be random, it can’t be special. Hence, it must be typical. Thus, like the von Mises-Church definition, typicality is a necessary condition for randomness. Martin-L¨of [Martin-L¨of, 1965] turned this around to make typicality also a sufficient condition for randomness. A binary sequence is random according to Martin-L¨ of if and only if it is typi-
Philosophy of Statistics: An Introduction
37
cal, i.e., if and only if it does not satisfy any all effectively specifiable typical property. Martin-L¨ of’s work showed that this indeed constitutes a desirable definition of randomness, and does not exhibit the problems that plagued the von Mises-Church definition. In fact, Martin- L¨of’s definition has been the most satisfactory definition of randomness till to date. 6. 1960s–1970s: Solomonoff, Kolmogorov, Chaitin, etc, introduce the idea of complexity of finite strings. This approach, now known as Kolmogorov Complexity, solved the problem of degrees of randomness in finite binary strings. The main idea is that a finite binary string is non-random if it has algorithmic descriptions that are shorter than the string itself, and it is random otherwise. More precisely, given a string x, the length K(x) of its shortest description can be used to measure its degree of randomness: The smaller the value of K(x) compared to the length of x, the less random x is. According to this idea, the randomness of a binary sequence is equivalent to its incompressibility, which means that none of its initial segments has much shorter descriptions. 7. Martingales and an “unpredictability definition” of randomness. It is also possible to give a stronger version of the von Mises-Church definition using capital betting strategies, or gambling strategies which take into account the amount of bet, technically known as martingales. This has resulted in a more satisfactory definition of randomness, equivalent to Martin-L¨of’s definition of randomness. In the following, when we mention randomness as unpredictability, we will mean this martingale definition. (iii) Three paradigms of algorithmic randomness: We discuss how apparently three distinct approaches to algorithmic randomness have merged in the end to an equivalent definition of randomness. As indicated in the brief historical outline above, there are three “paradigms” for defining randomness: 1. Unpredictability, 2. Typicality, and 3. Incompressibility. The three paradigms appear to be very different approaches for defining randomness. It is therefore a remarkable mathematical theorem that the three apparently distinct notions in fact coincide! Roughly speaking, this means that a sequence is unpredictable if and only if it is typical if and only if it is incompressible. This remarkable coincidence has led to the formulation of the so called Martin-L¨ ofChaitin thesis, which says that each (and so all) of these definitions gives the true definition of randomness. Like the Church-Turing thesis for the definition of algorithm, the Martin-L¨ of-Chaitin thesis is not a mathematical theorem or a
38
Prasanta S. Bandyopadhyay and Malcolm R. Forster
conjecture subject to proof, but rather a proposal arising from strong evidence. Algorithmic randomness thus provides so far an unrivaled mathematical foundation for randomness. Furthermore, since the works of von Mises, Church, and Martin-L¨ of, algorithmic randomness has been a very active and lively area of research providing deep philosophical insights into the notion of randomness and its finer ramifications. Dasgupta’s article provides a survey of such research up to the present day. In this subsection, so far we have followed Dasgupta’s both mathematical and historical development of the topic. The difference between Dasgupta’s paper and Bennett’s paper is that the former is more mathematically oriented without being oblivious of the history of the development of the randomness of a sequence, whereas Bennett’s paper is more intuitively accessible. Bennett discusses why randomness is hard to define. She makes a distinction between “relative randomness” and “absolute randomness” when the notion of relative randomness is defined relative to a set of properties. She thinks that both “relative randomness” and “absolute randomness” seem “fickle and unfair” in the short run, but they must appear predictable in the long run. She argues that we need the relative randomness to satisfy our notion of uncertainty. In fact, she also thinks that we need both notions of randomness, because we want the relative randomness to measure up to ideal randomness as much as possible. Some special topics like “normal approximations”, the “Stein phenomenon” (also known as the “Stein paradox”), and “data mining” are also crucial for understanding the state of the art research that has been done in statistics, computer science and other fields.
7.4
Normal approximations
The technique of approximating the distribution of a random variable by a normal (Gaussian) distribution is known as a normal approximation. It is the central limit theorem that justifies the use of normal approximations in commonly encountered settings. Boik describes the theory and application of the central limit theorem (CLT). According to the CLT, if certain mild regularity conditions are satisfied, then the distribution of suitably standardized sum or mean of a sequence of a random variable approaches a normal distribution as the number of random variables in the sequence increases. For example, if Y1 , Y2 , . . .Yn is a random sample from an infinite sized population, whose √ mean and standard deviation are µ and σ respectively, then the distribution n(Y − µ)/σ converges to a normal distribution as the value of n increases to infinity, where Y is the mean of the random sample. Boik has considered how the scope of the CLT can be expanded to include nonlinear functions of means and sums by the delta method and Slutsky’s theorem. He also describes and illustrates several applications of the CLT. These applications include approximating the distributions of sums or means of discrete random variables; approximating the sampling distributions of sums or means of random samples from finite sized populations; approximating the sampling distributions
Philosophy of Statistics: An Introduction
39
of test statistics in permutation tests; and many more. One highlight of his paper is a discussion of a result that justifies multivariate normal approximations for Bayesian posterior distributions of parameters. In addition, he describes how the accuracy of normal approximations can be improved by making small adjustments to the approximations.
7.5
Stein phenomenon
Richard Charnigo and Cidambi Srinivasan have written on the topic of the Stein phenomenon. Since among philosophers there is an intense interest in the problem, we will discuss this topic to make it more accessible to our general readers [Sober, 2008]. We first consider some elementary ideas from statistics to see how “Stein’s phenomenon” is startling from the perspective of the relation between the sample mean and the population mean. The sample mean is widely regarded as a good estimate of the population mean, since its expected value is the same as the population mean (it is an unbiased estimator) and it has the least variation around the population value compared to any other unbiased estimator under normal conditions. If we assume the data to be normally distributed with some unknown mean and variance, then the sample mean is, in fact, the maximum likelihood estimate (MLE) of the population mean. For a given data set and an underlying probability model, MLE picks the values of model parameter that makes the data “most likely.” In sum, the sample mean provides a natural way of “extracting information” about the population mean from the data. The sample mean is assumed to be optimal with regard to estimating the population mean because no other estimator has all these properties. Several optimality properties of the sample mean have been subsequently proved, providing a sound foundation for statistical methodology. These features of the sample mean led to the belief that when one would like to estimate several population means simultaneously, one should use sample means as they are likely to be optimal. Charles Stein showed that this belief is unfounded. In fact, when three or more parameters are estimated simultaneously, their combined estimator is more accurate (in the sense of minimizing the expected squared error) than any other statistic that estimates the parameters separately. This is known as the “Stein Phenomenon.” Consider that we are running doughnut shops in the state of New York. Our intention is to know how many doughnuts we should expect to sell this year (2010) based on how many products we sold last year in each of our eight stores (S1, S2, S3, . . .S8). Let us assume further that stores are positioned geographically so that sales in each store may be regarded as statistically independent of the sales of the other stores.13 We would like to estimate the parameter, µ1 = the expected 13 To
make this example more intuitive, we could assume that we sell doughnuts in one store and completely unrelated items in the other stores. In the second store, we sell shoes, in the third store, we sell like insurance. In accordance with this setup of the example, our discussion in this subsection will change correspondingly. For example, we have to write now, similarly, µ = the expected sale of shoes in store two in 2010 based on the sample, X2 = the observed sale of shoes of per 10,000 people in 2009 and so on for both another six parameters and observed sale
40
Prasanta S. Bandyopadhyay and Malcolm R. Forster
sale of doughnuts in store one in 2010 assuming that each month consists of 30 days, based on the sample X1 = the observed sale of doughnuts in 2009. Similarly, µ2 = the expected sale of doughnuts in store two in 2010 based on the sample, X2 = the observed sale of doughnuts of per 10,000 people in 2009 and so on for both another six parameters and observed sale of doughnuts in 2009. “Stein’s Phenomenon” shows that if we are interested in estimating µ1 through µ8 , simultaneously, the best guesses are not X1 through X8 respectively. Rather we get a better estimate for each of the eight parameters by a formula that makes use of the other data points than any measure that estimates the parameters separately. Consider taking the MLE of each population parameter based on each sample mean separately. Let x be the n-tuple of the observed values of X1 , X2 , . . ., Xn . Whereas x is the MLE estimate of the n-tuple of population means, the Stein estimator is ! (n − 2) σ 2 x, 1− 2 |x| where σ 2 is the common variance of each random variable, which we assume to be known. The Stein “correction” has the effect of shifting all the values towards zero. Compare these two estimates. In the Stein case, we get a better estimate of the eight parameters than what we can get in the first case, where “better” means that the sum of the expected squared errors is lower. In fact, Stein proved that the Stein estimator dominates the MLE estimator when n is three or more although for any single estimate MLE could perform better. In addition, for a specific estimate of a single parameter, the MLE and Stein estimate could be comparable. The fact that the Stein estimator dominates MLE and its cognates shows that the latter are suboptimal. The MLE and the like are not optimal in correctly estimating three or more unrelated parameters simultaneously. Stein estimation has a seemingly paradoxical nature. As a result, it is also called “Stein’s Paradox.” When estimating the expected doughnut sale in store 1 in 2010 (i.e., µ1 ), we should use the observations of the other stores even though their sales are statistically independent of doughnut sales. It seems paradoxical that the estimate for µ1 should depend on X2 or X3 , since they are statistically independent. We need to be careful about the exact import of the Stein’s estimator. If we are only interested in minimizing the expected squared error for the sales in store 1, then there is no advantage of using the other variables. It is the sum of the expected squared errors that is made reliably smaller by Stein’s estimator, and not the expected squared errors individually One way to provide an intuitive understanding behind the Stein phenomenon and thereby to take away some paradoxical air from “Stein’s paradox” is to think of estimating eight parameters simultaneously as being randomly generated from a single distribution with one common mean, π, in terms of the expected annual sale of in 2010 per 10,000 people over eight stores. Then, even though every for other unrelated items in 2009. Since this will make the discussion more complex, we have worked with the doughnut store example as stated above with some oversimplification.
Philosophy of Statistics: An Introduction
41
observation X1 , X2 , X3 . . .X8 contains information about the common mean, π, and X1 contains the most information about µ1 , X2 , X3 , X4 , . . .X8 also contains some indirect information about µ1 . Whether this intuitive clarification really takes away the paradoxical nature of Stein phenomenon, one should be careful to remember that Stein’s result is a mathematical result that holds independently of whether an intuitive explanation makes sense. Both Charnigo and Srinivas have made this point very clear in their chapter. To return to the theme of the paradoxical nature of Stein’s phenomenon, Robbins [1951] wrote “X1 could be an observation on a butterfly in Ecuador, X2 on an oyster in Maryland, X3 the temperature of a star, and so on.” The seeming weirdness of Stein’s phenomenon is to ask, “to estimate the number of butterflies in Ecuador, should one jointly estimate the number of butterflies in Ecuador and oysters in Maryland”? If what we want to estimate is the number of butterflies, then we should estimate it. But, if we are concerned with both butterflies in Ecuador and oysters in Maryland, then we should get an estimate for both. For the latter estimating these two quantities jointly will be quite reasonable since Stein’s phenomenon would allow for reduction in the expected error.
7.6
Data-mining
In the twenty-first century, we cannot afford to live without the use of data in some form or the other, whether the data involve our credit card information, information about car theft, or information about our genotype. How investigators could make use of data effectively to extract the right sort of information from them is a daunting task. Although there are traditional statistical tools available, today we tend to rely increasingly on computer applications for processing information from the data. Choh Man Teng in her chapter gives an overview of the area where data mining is at work. Knowledge discovery from data is one area that extracts information from a huge body of data. The information that can be discovered from several such areas need not be mutually exclusive. These areas could be understood as performing different tasks, namely (i) description, (ii) prediction, and (iii) explanation. Descriptive tasks consist of describing a large amount of data succinctly. This task does not involve uncertainty. The prediction task involves finding a mechanism that will reliably foretell the value of a target feature of an unseen instance based on information about some known features. Unlike the descriptive task, it makes an inference that involves uncertainty. However, in the prediction task, predictive accuracy is a central concept. The explanation task aims at discovering the underlying mechanism that generates the data thus providing an explanation for the generation of this set of data rather than another. To retrieve or construct a model from a given data set, investigators sometimes face both under-fitting and over-fitting problems and thus construction of a model results in a trade-off between bias and variance. Here, many statistical tools and analysis are applied to resolve or minimize the problem.
42
Prasanta S. Bandyopadhyay and Malcolm R. Forster
Interestingly, the reference class problem, well known to philosophers of probability, arises in data-mining. Consider an example of the reference class problem to see how the latter could arise in data-mining. In one dorm, there are students except Mary who watched a rental movie, which 90% of them dislike. In another dorm, all of them are French majors; 95% of them liked the film very much. Mary is a French major, but Mary lives in the first dorm. Mary is yet to watch the movie. Mary has no access to any other information on which to base an estimate of how much she will like the movie. Which reference class should Mary belong to? Teng discusses some measures of handling the reference class problem. Besides the reference class problem, she also discusses how sometimes we fall back on experts both to ensure correct analysis and to check whether procedures are properly executed. She also considers the advantages and disadvantages of replacing experts or statisticians by automatic computer guided data analysis. 8
AN APPLICATION OF STATISTICS TO CLIMATE SCIENCE
A great deal of statistical research has been used to analyze issues that confront modern society. The topics range from determining the causes of the cholera epidemic in London in 1854, to finding the cause of arsenic poisoning in Bangladesh, to the current hotly debated topic of climate change. Mark Greenwood, a statistician, teamed with Joel Harper and Johnnie Moore, two geo-scientists, to write a paper on the application of statistics to climate change focusing on finding evidence of climate change in a specific area. They provide an application of statistical modeling to climate change research. They assess the evidence of climate change in the timing of northern Rocky Mountains’ stream-flows, which is one way of measuring potentially earlier snowmelt caused by warming climate. A nonparametric spatial-temporal model is used to assess evidence for change and then to estimate the amount of change by using that model. The methods illustrate estimation of nonlinear components in complicated models that account for spatial-temporal correlations in measurements. They found evidence for the presence of climate change by using a model selection criterion, the AIC (see section 2.5 for more on the AIC framework). What they have done in this paper is to use AIC with the assumption that data have a linear trend, and then applied AIC with the assumption that data have a non-linear trend. Strikingly, AIC with the non-linear trend agrees with the historical data much better, although the non-linear model uses many more degrees of freedom than the model that assumes the change has been linear. They compare results for models that include the non-linear trend, and then they constrain it to be linear; later, they remove it entirely from the model. Using AIC to assess evidence of climate change is a new idea because much of the literature on climate change has focused on hypothesis testing methods for evidence of climate change. In contrast to much of the other research on stream-flow timing measures in the Western US, their methods provide a regional level estimate of the common climate change signal. They find the trend to be more complicated than a linear trend.
Philosophy of Statistics: An Introduction
43
Thus, they conclude, the impacts in this region on the timing of stream-flow are not linear over the time period. In fact, their findings imply that the magnitude of the change is lower than has been previously suggested in other research. They propose methods that have the potential for application in other areas of climate change research, where a priori assumptions of linear change in systems over time may be suspect and the data sets are collected over space as well as time. 9
HISTORICAL TOPICS IN PROBABILITY AND STATISTICS
The final two papers of the volume are of historical nature. One chapter examines how the subjective and objective notions in probability evolved and developed in the western world over a period of three centuries, culminating in the middle of nineteenth century. The other paper looks at how the notion of probability has been used in Ancient India before probability made inroads in the western world. Let us begin with Sandy Zabell’s paper on the first topic. The use of subjective and objective probabilities has been a great source of debate currently among different statistical paradigms. Bayesians often are pitted against non-Bayesians when the former are taken to invoke subjective priors, whereas the latter eschew them. Zabell looks closely at the historical development of the subjective and objective probabilities in the western world dating back to seventeenth century. Although there is no clear-cut distinction between subjective and objective probability reasoning as early as in seventeenth and eighteenth centuries, he contends that a clear-cut distinction between the two, subjective and objective probabilities, has emerged much later in the middle of the nineteenth century. Looking at the original texts of the giants of the probability theory like Bernoulli brothers, Cournot, Poisson, Laplace, De Morgan, D’Alambert including Mill, Venn and others- he has set the debate in the context of some well-known philosophers of that era like Voltaire, Leibniz, Kant and Hume. Thus, Zabell is able to create a perspective where one could see the historical development of the distinction between the subjective and objective. In this paper, there are several historical and philosophical comments that many modern probability theorists might find fascinating. He discusses how meanings of some words like “subjectivity” underwent a sea-change as the expression “subjectivity” was used to mean “objectivity” during sixteenth century. Descartes’ Meditation III is a case in point. He also discusses the influence of Kant on the thinking about probability, although the probability statement is not included in Kant’s list of judgments. His interest does not overlook little known figures like Ellis. He discusses how Venn’s The Logic of Chance, although borrowing most his ideas from Ellis’ work, makes only one passing reference to it. Some of the most exciting comments of Zabell occur when he distinguishes three senses of epistemic probability while he discusses Bertrand’s work. These three senses are 1. an epistemic probability could be subjective in the first sense when it is an attribute of the subject perceiving rather than the object perceived.
44
Prasanta S. Bandyopadhyay and Malcolm R. Forster
2. an epistemic probability could be subjective in the second sense because it can vary from one individual to another as different individuals might possess different information. 3. an epistemic probability could be subjective in the third sense because it could vary from one individual to another even if they possess the same information. He attributes the first sense to Cournot and Venn and the second one to Bertrand. The third sense he thinks goes at the heart of the debate between the objective and subjective probability theorists of the 20th century (see section 2.3 for more on this debate). C.K. Raju investigates the notion of probability and to some extent mathematics in Ancient India. He discusses how the notion of permutation and combination were connected with poetic meters and allied themes before the 3rd century CE in the Chandahsutra. He explains how the ideas of permutation and combination were actually applied to the Vedic meter, the earliest known written accounts relating to permutations and combinations. This dates back even earlier than the third century CE. He discusses how the game of dice is central to understanding probabilistic reasoning in Ancient India, because there are numerous stories in Ancient India where many of the protagonists were addicted gamblers. Raju goes back to some of the claims made by Ian Hacking who attributes it to an Indian statistician, Godambe, and claims that the notion of sampling is found first in the Mahabharata, one of two epics of Indian culture written more than 2000 years ago [Hacking, 1975]. Raju discusses the story behind Hacking’s attribution of the idea of sampling to the Mahabharata. In the Nala-Damayanti episode, the King Rituparna told Nala, his charioteer, exactly how many nuts the entire tree had by sampling just one branch of the tree. However, Raju suggests why we will not find numerous references to the art of probabilistic reasoning in Ancient India by referring to the Nala-Rituparna dialogue. In this dialogue, Nala asked the king “how he could tell the exact number of nuts in the tree without counting them individually.” Rituparna, the king, said “this knowledge is secret,” implying that it should seldom be shared with others. This observation is as old as India herself. This is one of the reasons for the West tends to call much of ancient Indian philosophy “mystical” and “irrational.” Raju shows that a philosophy of mathematics, which he calls “zeroism” can resolve a long-standing difficulty with the frequentist interpretation of probability, namely that relative frequency converges to probability only in a probabilistic sense. Further, he explores the relation of both Buddhist and Jaina logic to probabilities in quantum mechanics as, all of them, both Buddhist and Jaina logic, and probabilities in quantum mechanics, are non-truth-functional. A logic is considered to be non-truth-functional if and only if the truth value of a compound sentence is not function of its constituent sentences. Raju discusses how the distribution law between two truth-functional operators “and ” and “or ” fails in quantum mechanics. In a double-slit experiment, to say, that ‘the electron reached
Philosophy of Statistics: An Introduction
45
the screen and passed through slit A or slit B’ is not the same thing as saying that ‘the electron reached the screen and passed through slit A or the electron reached the screen and passed through slit B.’ In one case, the author contends, we get a diffraction pattern, whereas in the other, we get a superimposition of two normal distributions. Exploiting this result in quantum mechanics, Raju explains in which sense the Buddhist and Jaina logic are quantum mechanical. 10
PLANS BEHIND ARRANGING MAJOR SECTIONS/CHAPTERS
Having discussed all the papers briefly in our volume, it is worthwhile to give the reader some idea why we divided the entire volume into fourteen major divisions. Since many of the chapters have addressed key issues in philosophy of statistics, those issues and topics lend themselves to technicalities and intricacies in probability and statistics. To get our readers up to the speed, so to speak, regarding various papers in the volume, Prasanta Bandyopadhyay and Steve Cherry have written a primer on probability and statistics without presupposing any technical knowledge. The notion of conditional probability is a central concept for probability and statistics. However, it has raised several philosophical problems regarding its correct interpretation. As already noted, we have a section heading on “Philosophical Controversies about Conditional Probability” which contains two papers: (i) H´ ajek’s paper on “Conditional Probability” followed by Easwaran’s paper on “The Varieties of Conditional Probability,” which explain the concept in detail. Given this preparation on the fundamentals of probability and statistics, we have then introduced four influential statistical paradigms: (1) Classical/errorstatistics, (2) Bayesianism, (3) Likelihoodism, and finally (4) the Akaikean framework. Each section dealing with the four paradigms usually consists of more than one paper. The classical/error statistics section consists of two papers: (i) Mayo and Spanos’ paper on “error statistics” followed by Dickson and Baird’s paper on “Significance Testing.” Our largest section is on Bayesian paradigm. We divide that section under several sub-headings. Under the subsection on “Subjectivism”, we have Weirich’s paper on “Bayesian Decision Theoretic Approach.” The subsection on “Objective Bayesianism” consists of two papers: (i) Bernardo’s paper on “Modern Bayesian Inference: Foundations and Objective Methods” and (ii) Wheeler and Williamson’s paper on “Evidential Probability and Objective Bayesian Epistemology.” The next subsection on “Confirmation Theory and its Challenges,” consists of two papers: (i) Hawthorne’s “Confirmation Theory” followed by (ii) Norton’s paper “Challenges to Bayesian Confirmation Theory.” The subsection on “Bayesianism as a form of “logic” consists of (i) Howson’s paper “Bayesianism as a Pure Logic of Inference”, and (ii) Festa’s paper on “Bayesian Inductive Logic, Verisimilitude and Statistics.” The next paradigm is the likelihood framework. Two papers, (i) Blume’s paper on “Likelihood and its Evidential Framework”, and (ii) Taper and Lele’s paper on “Evidence, Evidence Functions, and Error Probabilities” belong to this section. The section on the Akaikean framework has just only one paper which is, Forster and Sober’s paper on “AIC Scores
46
Prasanta S. Bandyopadhyay and Malcolm R. Forster
as Evidence: A Bayesian Interpretation.” The next section consists of a single paper by Grossman on “The Likelihood Principle.” This is followed by the section called “Recent Advances in Model Selection.” This section contains two papers: (i) Chakrabarti and Ghosh’s paper on “The AIC, BIC and Recent Advances in Model Selection” and Dawid’s paper on “Posterior Model Probabilities.” Section six titled “Attempts to Understand Aspects of Randomness” has two papers. The first one is by Deborah Bennett on “Defining Randomness” and the second one is by Abhijit Dasgupta on “Mathematical Foundations of Randomness.” Our section on “Probabilistic and Statistical Paradoxes” consists two papers. One is by Vineberg on “Paradoxes of Probability” and the other one is by Tsao on “Statistical Paradoxes: Take it to the Limit.” We have a single article by Romeijn under the section heading “Statistics as inductive inference”. This section on “Various Issues about Causal Inference” has two papers. One is by Spirtes on “Common Cause in Causal Inference” and the other one is by Greenland on “The Logic and Philosophy of Causal Inference: A Statistical Perspective.” We have two papers on the section on “Some Philosophical Issues Concerning Statistical Learning Theory.” The fist paper is by Harman and Kulkarni on “Statistical Learning Theory as a Framework for Philosophy of Induction” and the second one is by Steel on “Testability and Statistical Learning Theory.” In the next section we have brought three papers under the heading “Different Approaches to Simplicity Related to Inference and Truth.” They are (i) De Rooij and Gr¨ unwald’s paper on “Luckiness, and Regret in Minimum Description Length” and (ii) Dowe’s paper on “MML, Hybrid Bayesian Network Graphical Models, Statistical Consistency, Invariance, and Uniqueness” and (iii) Kelly’s paper on “Simplicity, Truth, and Probability.” The section on “Special Problems in Statistics/Computer Science” includes three papers. They are (i) Boik’s paper on “Normal Approximations,” (ii) Charnigo and Srinivasan’s paper on “Stein Phenomenon”, and thirdly and finally (iii) Teng’s paper on “Data, Data, Everywhere: Statistical issues in Data Mining.” Greenwood, Harper, and Moore’s paper on “Applications of Statistics in Climate Change: Detection of Non-Linear Changes in a Stream-flow Timing Measure in the Columbia and Missouri Headwaters” constitutes a single contribution in the section on “A Statistical Application to Climate Change”. Our last section on the “Historical Approaches to Probability and Statistics” consists of two papers: (i) Zabell’s paper on “The Subjective and the Objective”, and Raju’s paper on “Probability in Ancient India”. 11
CODA14
We began our journey with the claim that both philosophers and statisticians are in a sense interested in the same problem, i.e., “the problem of induction”. The papers discussed in this Introduction explain at some length how they approach the problems and issues pertaining to statistical inference, broadly construed. How14 The
section has been promoted by Jayanta Ghosh’s observation.
Philosophy of Statistics: An Introduction
47
ever, the time has come when we need to take pause and enumerate what or whether we have learned about the “philosophy of statistics” from these papers. Readers have no doubt noticed disagreements between various authors regarding the proper approach to statistical problems. Observing such a disagreement, the British philosopher, George Berkeley might have commented, “[they] have first raised a dust and then complain that [they] cannot see [Berkeley, 1970, 1710].” However, this time “they” include both philosophers and non-philosophers. We mentioned in the beginning that the philosophy of statistics is concerned with foundational questions in statistics. In fact, the debate in philosophy of statistics is, in some sense, comparable to the debate in the foundations of mathematics that began in the early part of 20th century. It is well-known that Frege, working on the foundations of arithmetic, assumed the consistency of set-theoretical axioms, that Russell showed later to be inconsistent. This was followed by G¨ odel’s incompleteness results shaking the very foundation of mathematics. It is true that there has not been any comparable result in the foundations of statistics as in the foundations of mathematics. Yet there is a similarity between the two in terms of the debates about their foundations and their subsequent developments toward real-world applications. Despite the fact that there are foundational questions that remain unresolved in mathematics, the latter has been used and applied extensively including in the development of the Newtonian physics, relativistic theories, quantum mechanics, the construction of the double helical structure of the DNA and the like that make stunning predictions. The point of this is that foundational debates have no longer stopped mathematics to be exploited, and then later on used to solve real world problems. In the same vein, even though there are debates about the correct foundation of statistics, statistics has an applied side in which different statistical tools and soft-wares have been widely used to make diagnostics studies, weather forecasting, estimating the possibility of the distribution of life in and beyond our solar system and so on. The question remains, “what then is the philosophy of statistics that is taught by our esteemed authors?” Without waiting to see whether the dust has fully settled on the foundational issues, one could venture to address this pressing question. For a sympathetic reader, philosophy of statistics will appear to be more like a mosaic of themes consisting of several seemingly unrelated features. Different statistical schools contribute different aspects of our understanding of its underlying themes. Error statistics provides us with tools for handling errors in testing hypotheses while making us aware of the key role “errors” play in scientific investigations. The likelihood framework shows why the concept of evidence is so crucial in science and consequently needs to be separated from the concept of belief. The Akaikean framework is interested in the concept of prediction and provides a way of capturing this concept to some extent. In contrast, Bayesianism is an over-arching school which covers many of the concepts just mentioned if one is allowed to have prior information about the topics in question. In this mosaic of themes, both computer scientists and mathematicians have not lagged behind. They also help us understand the uncertainty involved in carrying out inductive
48
Prasanta S. Bandyopadhyay and Malcolm R. Forster
inference. Similarly, applied statisticians have helped us to appreciate another side of this problem. This applied side of statistics, in the form of making reliable inference, causal and non-causal in which new algorithms and causal diagrams, has begun to make breakthroughs in wider contexts, like diagnostic studies. Two historical papers have also their place in this mosaic. One traces back the debate about the subject/object notions of probability as early as to the seventeenth century western intellectual history. The other paper contends that we can’t afford to be ethno-centric about some of the probabilistic/statistical notions as the latter can be found in other tradition(s) much before the emergence of probability and statistics in the western tradition. Much more research needs to be conducted in the emergence of probability in non-western traditions. The philosophy of statistics has this huge canvass that includes, among other topics, these theoretical and practical sides that have so far been treated as two different aspects, with literally no hope of converging. Hopefully, in the next quarter of a century, we might be able to see much more interaction between them, so that the next generation of this series will keep us abreast of this symbiosis in addition to the raging foundational debates that will remain a staple of philosophers for many years to come. ACKNOWLEDGEMENTS We have received numerous help from several people while writing this introduction. Individuals who have offered their help in terms of their expertise are Deborah Bennett, Jeffrey Blume, Robert Boik, Richard Carnigo, Arijit Chakrabarti, Abhijit Dasgupta, Jason Davey, Philip Dawid, Steven de Rooij, David Dowe, Roberto Festa, Mark Greenwood, Peter Gr¨ unwald, Alan H´ajek, Gilbert Harman, James Hawthorne, Colin Howson, Kevin Kelly, Deborah Mayo, Sanjeev Kulkarni, John Norton, C. K. Raju, Jan-Willem Romeijn, Peter Spirtes, Cidambi Srinivasan, Mark Taper, Choh Man Teng, C. Andy Tsao, Susan Vineberg, Gregory Wheeler, Paul Weirich, Jon Williamson, and John Woods. We are thankful to Jane Spurr for her willingness to incorporate even very minor changes numerous times when the introduction has been written. We are especially indebted to Gordon Brittan, Jayanta Ghosh, John G. Bennett, Sander Greenland and Billy Smith for their substantive suggestions regarding the entire manuscript either in terms of helping us sharpen various philosophical/statistical arguments or improving our writing style. Without the help from all these individuals, it would have been a much worse paper. We are, however, solely responsible for any error that it might still contain. PSB’s research has been supported by the NASA’s Astrobiology Research Center grant (#4w1781.)
Philosophy of Statistics: An Introduction
49
BIBLIOGRAPHY [Berger, 1985] J Berger. Statistical Decision Theory and Bayesian Analysis, 2nd edn. Berlin: Springer-Verlag, 1985. [Berkeley, 1710] G. Berkeley. A Treatise Concerning the Principles of Human Knowledge. Turbayne, C. M., ed., Indianapolis: Bobbs-Merrill, 1710; Edition used is by Turbayne’s 1970 edition. [Bernardo, 2005] J. Bernardo. Reference Analysis. In Handbook of Statistics. 25. P.17-90. (D.K. Dey and C.R. Rao, eds). Elsevier, North-Holland, 2005. [Beranardo, 1997] J. Bernardo. Noninformative Priors Do not Exist: A Dialogue with Jose Bernardo. Journal of Statistical Planning and Inference. 65, p.157-189, 1997. [Bernardo and Smith, 1994] J. Bernardo and A. Smith. Bayesian Theory. New York. John Wiley, 1994. [De Finetti, 1937] B. De Finetti. La prevision: ses lois logiques, ses sources subjectives. Translated in H.E. Kyburg, Jr. and H.E. Smokler (eds.) Studies in Subjective Probability, New York: Wiley, 1964, pp. 93-158, 1937. [Fisher, 1973] R. Fisher. Statistical Methods and Scientific Inference. 3rd edition. New York: Wiley, 1973. [Forster and Sober, 1994] M. Forster and E. Sober. How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions. British Journal for the Philosophy of Science 45. Pp.1-36, 1994. [Greenland et al., 1999] S. Greenland, J. M. Robins, and J. Pearl. Confounding and Collapsibility in Causal Inference. Statistical Science, Vol. 19, pp. 29-46, 1999. [Gr¨ unwald, 2007] P. Gr¨ unwald. The Minimum Description Length Principle, MIT Press, 2007. [Hacking, 1975] I. Hacking. The Emergence of Probability. Cambridge University Press. UK, 1975. [Howson and Urbach, 2006] C. Howson and P. Urbach. Scientific Reasoning: The Bayesian Approach, 3rd edn. Open Court, Illinois, 2006. [Hume, 1739] D. Hume. A Treatise of Human Nature. London, 1739. Edition referred is by P.H. Nidditch, Oxford, The Clarendon Press, 1978 edition. [Joyce, 1999] J. Joyce. The Foundations of Causal Decision Theory, Cambridge University Press, UK, 1999. [Lele, 2004] S. Lele. Evidence Functions and the Optimality of the Law of Likelihood. In M. Taper, and L. S (eds.) The Nature of Scientific Evidence. University of Chicago Press. Chicago, 2004. [Lewis and Shelby-Richardson, 1966] D. Lewis and J. Shelby-Richardson. Scriven on Human Unpredictibility. Philosophical Studies, vol 17, number, 5, Pp.69–74, 1966. [Martin-L¨ of, 1965] P. Martin-L¨ of. The Definition of Random Sequences. Inform. Control 9. 1966. pp.602-619, 1965. [Neymean, 1967] J. Neyman. A Selection of Early Papers by J. Neyman. Berkeley: University of California Press, 1967. [Pagano and Gauvreau, 2000] M. Pagano and K. Gauvreau. Principles of Biostatistics. (Second edition). Duxbury, Australia, 2000. [Robins, 1951] H. Robins. Asymptotically Subminimax Solutions of Compound Statistical Decision Problems. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. L. LeCam and J. Neyman (eds.) pp.355-372. University of California Press, 1951. [Royall, 1997] R. Royall. Statistical Evidence. A Likelihood Paradigm. Chapman & Hall. London, 1997. [Savage, 1972] L. Savage. Foundations of Statistics, Dover, New York, 1972. [Scriven, 1965] M. Scriven. An essential unpredictability in human behavior. In Scientific Psychology: Principles and Approaches, B. B. Wolman and E. Nagel, eds, pp. 411–425. Basic Books (Perseus Books), 1965. [Seidenfeld, 1979] T. Seidenfeld. Philosophical Problems of Statistical Inference: Learning from R. A. Fisher, Dordrecht, Boston. D. Reidal Publishing Company, 1979. [Skyrms, 1984] B. Skyrms. Learning from Experience. In Pragmatics and Empiricism. Yale University Press: New Haven; pp. 37-82, 1984. [Sober, 2008] E. Sober. Evolution and Evidence. Cambridge University Press. United Kingdom, 2008.
50
Prasanta S. Bandyopadhyay and Malcolm R. Forster
[Taleb, 2010] N. Taleb. The Black Swan: The Impact of the Highly Improbable, 2nd edn. Random House. New York, 2010. [Vapnik, 1998] V. Vapnik. Statistical Learning Theory. New York. John Wiley, 1998. [Weirich, 2004] P. Weirich. Realistic Decision Theory. Oxford, 2004. [Zabell, 2005] S. Zabell. Symmetry and its Discontents: Essays on the History of Inductive Probability. Cambridge University Press, United Kingdom, 1998.
Part I
Probability & Statistics
This page intentionally left blank
ELEMENTARY PROBABILITY AND STATISTICS: A PRIMER Prasanta S. Bandyopadhyay and Steve Cherry
1
INTRODUCTION
Some of the chapters of this volume, albeit thorough, are technical and require some familiarity with probabilistic and statistical reasoning. We believe that an introduction to some of the basic concepts of probability and statistics will be helpful for our general reader. Probability and statistics provide necessary tools to capture the uncertain state of our knowledge. In addition, probability/statistics will also provide necessary tool-kits to quantify our uncertainties. Our discussion will be mostly quantitative. Patient readers who work through some of those technicalities will be rewarded in the end by seeing how the technicalities contribute to a better appreciation of several of the papers which on many occasions lend themselves to philosophical issues of great importance. Below we provide a brief introduction the theory of probability and statistics. The introduction is at the level found in introductory statistics classes and discussed in many statistical textbooks. Typical texts used over the past few years at Montana State University include Devore [2008], Moore and McCabe [2006], and Deveaux et al., [2008]. Much of the material and many of the examples presented below are taken from Devore. We begin our discussion on the nature of probabilistic/statistical reasoning by contrasting probabilistic/statistical reasoning/inference with deductive reasoning/inference. The property of deductive validity of an argument is central to understanding the distinction between deductive inference and inductive inference, when deductive validity could be understood in terms of the monotonic property of reasoning. First, we define “deductive validity” followed by an understanding of the property of monotonicity. An argument is deductively valid if and only if it is logically impossible for its premises to be true, but its conclusion to be false. A sentence is a deductive consequence of others when it is logically impossible that they should be true but that sentence is false. Consider the set S1 which consists of the sentence p, and the sentence p → q, where “→” captures any if p then q sentences and the symbol “→” is known as “material conditional.” Here, q is a deductive consequence of S1 , which we write as “S1 q.” The symbol “” denotes deductive consequence relation, which is a relation between a set of sentences and
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
54
Prasanta S. Bandyopadhyay and Steve Cherry
a sentence. Monotonicity is a property of certain types of inferences and is appreciated in terms of the deductive consequence relation. A relation between sets of sentences and a sentence is monotonic if and only if when it holds between a set and a sentence, it also holds between any superset of the set and that sentence. S2 can be taken to be the superset of S1 consisting of S1 and ∼ p. The symbol “∼” means “it is false that.” Any deductive consequence relation represented by “” is by definition monotonic. For example, if S2 is any superset of the set S1 above (for which we have S1 q) then S2 q must hold too. To appreciate this point, consider the rule of modus ponens, which is a wellknown rule of deductive logic. 1. If it rains, then the ground will be wet. 2. It has rained. Therefore, the ground will be wet. “p” which is called “the antecedent” of the if-then sentence, represents the proposition “it rains.” “q”, which is called “the consequent” of the if-then sentence, represents the proposition “the ground will be wet.” The standard rules of deductive logic including modus ponens imply that adding new premises to a store of information can only increase the class of conclusions that can be deduced from them. If one adds, for example, ∼ p to the above set comprising of premise 1 and premise 2, then one would be able to deduce q along with other additional conclusions from the set, { ∼ p, p, p → q}, although the entire set would turn out to be inconsistent. Non-monotonic reasoning, which underlies much of our probabilistic/statistical reasoning, by contrast, allows the possibility that adding new information can actually result in dispensing with conclusions previously held. Many of our daily experiences are characteristic of non-monotonic reasoning in which we lose information in light of new evidence. Consider the following example. 1. All observed crows are black. 2. Therefore, all crows are black. Suppose we have observed so far that all crows are black. Seeing a picture of an albino crow (and assuming we have no reason to be skeptical about that picture) in the local newspaper, our belief that all crows are black is undermined. So an addition of information to our belief set could force us to lose some of our cherished beliefs. Here, the new premise about the presence of an albino crow has led to the rejection of its conclusion that all crows are black. This loss of information is solely a feature of non-monotonic reasoning that underlies much of probabilistic/statistical reasoning. Although we have introduced the idea of non-monotonic and monotonic reasoning intuitively in terms of whether we lose our existing information in one case and not in another, we need to be aware that it is the deductive validity that goes at the heart of monotonic reasoning which can’t be undermined by adding a new premise. The purpose of our examples
Elementary Probability and Statistics: A Primer
55
is just to illustrate this theme behind monotonic reasoning. Probability theory provides a better tool for handling inductive arguments via which non-monotonic reasoning has primarily been expressed. We will begin with probability theory. At the final section of this chapter, after learning about probability theory and statistics, we will, however, return to the theme of non-monotonic reasoning to evaluate whether arguments used in statistical/probabilistic inference necessarily involve non-monotonic reasoning.
2.1
Basic and derived rules of the probability calculus
Probability theory provides a tool for handling inductive arguments. Typically the first step is to define some needed terms. We define a probabilistic experiment (broadly) to be any process that produces outcomes which are not predictable in advance. The toss of a single six-sided die is a simple example. We know that one of six numbers between one and six will occur on any given toss but prior to a toss we cannot state with certainty which outcome will be observed. A list of all the possible outcomes of an experiment is called the sample space. We may be interested in the individual outcomes themselves or in sets of outcomes, i.e. we will be interested in subsets of the sample space and will refer to such subsets as events. In tossing a single six-sided die once the sample space is S = {1, 2, 3, 4, 5, 6} . We may be interested in the event of observing an even number. This is the subset {2, 4, 6}. If any of the individual outcomes in this event occur then the event itself is said to have occurred. This is true in general. For example, if we toss the die and observe a 3 then the events {1,3,5} and {1,2,3} have both occurred as has any other event containing 3 as an outcome. For any sample space the empty or null set is considered to be an event along with the entire sample space itself. The union of two events A ∪ B is the event that occurs if either A or B or both occur. The intersection of two events A ∩ B is the event that occurs if both A and B occur. The complement of any eventA is denoted Ac is the event comprised of all outcomes that are not in A. A collection of events A1 , A2 , · · · , An is said to be mutually exclusive (or disjoint) if none of the pairs have any outcomes in common. The probability of an event A, denoted P (A) is a quantitative measure of how likely we are to observe A in a single trial of an experiment. Initially, we define or interpret probability as is typically done in introductory statistics text books. The probability of event is the long run relative frequency with which the event occurs over very many independent trials of the experiment. This definition is controversial in part because it does not always make sense (What is the probability the Denver Broncos will win the Super Bowl in 2010?). For further discussion and additional interpretations of probability see [H´ajek, 2007]. We note that the mathematical rules and properties of probability described below do not depend on the specific interpretation of probability.
56
Prasanta S. Bandyopadhyay and Steve Cherry
The mathematical development of probability starts with three basic rules or axioms: 1. For any event A; 0 ≤ P (A) ≤ 1. If A has probability 0 then it is impossible and if its probability is 1 then it is certain to occur. Any event of practical interest will never have probabilities of 0 or 1. 2. Denoting the sample space by S, P (S) = 1(something has to happen). S could be considered a tautology. 3. For a sequence of mutually exclusive events A1 , A2 , A3 , · · ·, P (A1 ∪ A2 ∪ A3 · · ·) =
∞ X
P (Ai )
i=1
All other rules and properties of probability can be derived from these three axioms. Two simple rules that are usually established quickly are that the probability of the empty set ∅ is 0 and the so-called Complement Rule: for any event A P (Ac ) = 1 − P (A). Another rule that can be derived quickly is a modification of axiom 3. For a finite sequence of mutually exclusive events A1 , A2 , · · · , An P (A1 ∪ A2 ∪ · · · ∪ An ) =
n X
P (Ai ).
i=1
Most events are not mutually exclusive. It can be shown that for any two events A and B P (A ∪ B) = P (A) + P (B) − P (A ∩ B) . Intuitively this makes sense. The probability that event A or event B (or both) occur is equal to the probability that A occurs plus the probability that B occurs minus the probability that both occur (which has been double counted by the addition of the probabilities of A and B). Note that if A and B are mutually exclusive then their intersection is the empty set with probability 0 and we are back to the addition rule for mutually exclusive events given above. This rule can be extended to more than two events: for any three events A, B, and C, P (A ∪ B ∪ C)
= P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C)
In any given experiment we need to be able to assign probability to outcomes in some way. There are potentially many different ways of doing so as long as they do
Elementary Probability and Statistics: A Primer
57
not violate the axioms and other rules of probability. We have seen, for example, that the sample space for the experiment of tossing a six-sided die once is S = {1,2,3,4,5,6} . We note that the outcomes are six mutually exclusive events and so P (S) = P (1) + P (2) + · · · P (6) = 1. As long as the probabilities we assign to the individual outcomes are all between 0 and 1 and sum to 1, we will have a mathematically valid probability model for the experiment. There are an infinity of such possibilities. An empirical method of assigning probabilities, based on our frequentist interpretation of probability, is to toss the die a large number of times and record the outcomes. The assignment of probabilities would be based on the proportion of times each outcome occurred. Or we might reason that the die is fair in that we do not expect to see any one outcome more often than any other, i.e. each outcome is equally likely and taken together must sum to 1. Thus, each of the outcomes would be assigned a probability of 1/6.
2.2
Conditional probability and marginal probability
The sample space of an experiment contains all the possible outcomes. Sometimes additional information becomes available that constrains the set of possible outcomes. The probability of an event then is computed based on the outcomes in the constrained sample space and such a probability is called a conditional probability. Before giving a mathematical definition we look at a simple example. A geneticist studying two genes has access to a population of 1000 individuals. The table below shows a categorization of the individuals with respect to the genes. Gene 1 Dominant Recessive
Gene 2 Dominant 560 140
Gene 2 Recessive 240 60
The experiment will consist of drawing an individual at random from the population. By drawing “at random” we mean that all individuals have the same chance of being drawn, 1/1000 based on the information given in the table. If the person drawn has two dominant genes we label her D1D2. If the person has a dominant Gene 1 and a recessive Gene 2 we label her D1R2 and so on. The sample space is S = {D1D2, D1R2, R1D2, R1R2} . Based on what we expect to see over the long run a reasonable assignment of probabilities is P (D1D2) = 0.56, P (D1R2) = 0.24, P (R1D2) = 0.14, P (R1R2) = 0.06.
58
Prasanta S. Bandyopadhyay and Steve Cherry
This assignment was made by simply counting the number of individuals in each category and dividing by the total population size. Each of the numbers is between 0 and 1 and sum to 1 so the assignment is mathematically valid. Suppose we are interested in the probability that the person selected has a dominant Gene 2. We can directly apply the addition rule for mutually exclusive events and compute this as P (D2) = P (D1D2 ∪ D1R2) = P (D1D2) + P (D1R2) = 0.56 + 0.24 = 0.80. In this simple case we can also note that 800 of the 1000 individuals have a dominant Gene 1 and conclude that the probability is 800/1000=0.80. This is the unconditional (or marginal) probability of selecting an individual with a dominant Gene 2. Now suppose we are told that we will be selecting our individual from the group with a recessive Gene 1. We now know that we are only dealing with the subpopulation of 200 individuals with such a gene. The sample space is now S = {R1D2, R1R2} Of the 200 individuals in this subpopulation 140 are R1D2 and 60 are R1R2 and so the probability of a dominant Gene 2 is 140/200 = 0.70. We say that the conditional probability of a dominant Gene 2 given a recessive Gene 1 is 0.70. Mathematically we write P (D2|R1) = 140/200 = 0.70. In general, for any two events A and B with P (B) 6= 0, we define the conditional probability of A given B to be P (A|B) =
P (A ∩ B) P (B)
Considering the genetics example we see that P (D2|R1) =
140/1000 140 P (D2 ∩ R1) = = = 0.70. P (R1) 200/1000 200
Failure to recognize the appropriate restriction of the sample space leads to common errors in calculating probability. Suppose we are told that a family has two children and that at least one is a girl. What is the probability that both are girls? The most common intuitive answer to this question is, assuming boys and girls occur with equal probability, 0.5. This answer is wrong. Let GG denote the outcome of a girl born first and a girl born second, GB denote the outcome of a girl born first and a boy born second and so on. The sample space is S = {GG, GB, BG, BB}
Elementary Probability and Statistics: A Primer
59
and, assuming equal chances for a boy or girl, a reasonable assignment of probabilities to the outcomes is 0.25, i.e. each is equally likely. We are being asked to find the conditional probability of GG given or conditional on the information that at least one of the children is a girl. Denoting this event by B, we have B = {GG, GB, BG}. The probability that GG occurs is 0.25 and the probability that B occurs is 0.75. Note also that (GG ∩ B) = GG. Thus, P (GG|B) =
P (GG ∩ B) P (GG) 0.25 = = = 1/3. P (B) P (B) 0.75
Suppose we had phrased the question as: What is the probability of two girls given that the oldest is a girl. Denoting the latter event by B, we have B = {GG, GB}, and P (GG ∩ B) P (GG) 0.25 P (GG|B) = = = = 1/2. P (B) P (B) 0.50 The definition of conditional probability leads immediately to the so-called Multiplication Rule: P (A ∩ B) = P (A|B)P (B) The Law of Total Probability is an important result. Let A1 , · · · , Ak be a sequence of mutually exclusive and exhaustive sets. By exhaustive we mean that the union of these sets is the sample space. For any other event B, the unconditional probability of B is P (B) = P (B|A1 )P (A1 ) + · · · + P (B|Ak )P (Ak ). This rule can be used to prove Bayes’s Rule (or Theorem): Let A1 , · · · , Ak be a sequence of mutually exclusive and exhaustive sets, and B be event such that P (B) 6= 0. Then P (Aj |B) =
P (B|Aj ) P (Aj ) P (Aj ∩ B) = k . P P (B) P (B|Ai ) P (Ai ) i
The middle term is just an application of the definition of conditional probability. The numerator of the last term follows from the Multiplication Rule and the denominator is the Law of Total Probability. Bayes’ Rule provides insight into another type of problem for which our intuition leads to incorrect probability calculations. Suppose a test exists for a rare (only 1 in 1000 adults has the disease) but serious disease and that the test is quite accurate in that when the disease is present it will return a positive result 99% of the time and when the disease is absent a positive result will occur only 2% of the time. You are given the test and receive a positive result. How worried should you be? Your first inclination would be to focus on how likely the test is to return a positive result when the disease is present and many people would choose that figure (0.99) as the probability of having the disease. But the event of interest is not observing a positive result given the disease but the event of having the disease
60
Prasanta S. Bandyopadhyay and Steve Cherry
given a positive test result. Let D denote the event of having the disease and + denote the event of testing positive. Based on the above information we know the following: P (D) = 0.001, P (+|D) = 0.99, P (+|Dc ) . By Bayes’ Rule
P (D|+)
2.3
=
P (+|D) P (D) P (+|D) P (D) + P (+|Dc ) P (Dc )
=
0.99(0.001) 0.99(0.001) + 0.02(0.999)
=
0.00099 = 0.047. 0.02097
Probabilistic independence and generalized probability conjunction rule
Two events are said to be probabilistically independent if and only if the probability of the occurrence or non-occurrence of one event in no way affects the probability of the occurrence or non-occurrence of the other. Mathematically, we define two events to be independent if P (A|B) = P (B). It can be shown that this equality also implies P (B|A) = P (A). Note that under an assumption of probabilistic independence the Multiplication Rule becomes P (A ∩ B) = P (A|B)P (B) = P (A)P (B). Some textbooks define independence in this way and then show that independence implies P (A|B) = P (B). Typically in practical problems independence is an assumption that is made. It may be reasonable in some cases such as in assuming that repeated tosses of a six-sided die yield outcomes that are independent of one another (getting a 6 on one toss does not affect the chances of getting a 6 on the next, or any other toss). But in other situations it may not be a reasonable assumption. Subsequent to the explosion of the space shuttle Challenger in 1988 it was determined that the computed risk of catastrophic failure had been seriously underestimated due, in part, to unwarranted assumptions of independence of failure of related parts.
Elementary Probability and Statistics: A Primer
61
Mutual independence of more than two events is more complex. Mathematically it is defined as follow. A sequence of events A1, · · · , Ak are said to be mutually independent if for every k(k = 2,3, · · · , n) and every set of indices i1 , · · · , ik P (Ai1 ∩ · · · ∩ Aik ) = P (Ai1 ) · · · P (Aik ) . For example, three events A, B, and C are mutually independent if and only if all the following conditions are met P (A ∩ B) = P (A)P (B) P (A ∩ C) = P (A)P (C) P (B ∩ C) = P (B)P (C) P (A ∩ B ∩ C) + P (A)P (B)P (C) Pairwise independence does not imply mutual independence. Consider tossing two coins, a quarter and a dime. Let H1 denote the event of observing a heads on the quarter, T 2 denote the event of observing a tails on the dime, and C denote the event of observing either two heads or two tails. We assume the tosses are independent of one another. The sample space of the experiment consists of four equally likely outcomes of S = {H1H2, H1T 2, T 1H2, T 1T 2} . We have
P (H1 ∩ T 2) = P (H1T 2) = 1/4 = P (H1)P (T 2) P (H1 ∩ C) = P (H1H2) = 1/4 = P (H1)P (C) P (T 2 ∩ C) = P (T 1T 2) = 1/4 = P (T 2)P (C) P (H1 ∩ T 2 ∩ C) = P (∅) = 0 6= P (H1)P (T 2)P (C)
Thus, although the events are pairwise independent they are not mutually independent.
2.4
Probabilistic/logical independence and mutual exclusiveness of propositions
Two events may be logically independent without being probabilistically independent. Let A be the event than a randomly chosen individual drinks coffee and B be the event that a randomly chosen individual smokes. These two events are logically independent of one another because one does not imply the other nor does one imply the negation of the other. However, social research has shown that smokers are more likely to drink coffee than non-smokers and thus these two events are probabilistically dependent. Students in introductory probability courses frequently struggle with the relationship between two events being mutually exclusive of one another and being probabilistically dependent/independent of one another. It seems that it is a natural first inclination to consider two mutually exclusive events to be independent,
62
Prasanta S. Bandyopadhyay and Steve Cherry
but in fact if two events are independent of one another they must overlap in some way. If two events A and B are mutually exclusive (and both have non-zero probability of occurrence.) then knowing one has occurred automatically rules the other out because P (A|B) = 0 6= P (A). For example, if we let H denote getting a heads on a single toss of a fair coin and T denote getting a tails, then these two events are clearly mutually exclusive and dependent; P (H|T ) = 0 6= P (H). Formally, we say that mutual exclusivity implies dependence but the converse is not true. Thus, dependent events may or may not be mutually exclusive. The negation (independence implies two events are not mutually exclusive) is, of course, also true. For an example of two overlapping independent events, consider the experiment of tossing a fair coin twice. Let H1 denote the event of getting a heads on the first toss and let B denote the event of two heads or two tails. These two events overlap but P (H1 B) = 1/2 = P (H1).
3
FROM PROBABILITY TO STATISTICS AND A ROAD-MAP FOR THE REST OF THE CHAPTER
3.1
The fundamental difference between probability and statistics
One way to appreciate the difference between statistics and probability is to see how these two disciplines distinctively make inference based on data/sample. By “inference”, we mean the procedure by which one goes from the known data to the unknown value of the parameter which we are interested in knowing/estimating. The style of inference applied in statistics is an inference that goes from a sample to the population. It is the paradigm example of uncertain inference. Uncertainty of statistical inference stems from the fact that the inference is made from the known sample to the unknown population. In contrast, there are other kinds of inference where we make inferences about the unknown samples based on our information about the population. The latter kind of inference is known as non-statistical inference or inference involving probability. In short, drawing an inference from a sample to a population is statistics and drawing an inference from a population to a sample is mathematics that rests on using the theory of probability. Consider an urn consisting of a large number of balls of the same size, but of different colors. Suppose we know that 25% of the balls are red. We are going to draw a sample of 100 balls randomly with replacement from the urn. What is the probability that the proportion of red balls in the sample will lie between 0.20 and 0.30? Using the laws of probability we can answer this question exactly. Suppose
Elementary Probability and Statistics: A Primer
63
we know nothing about the proportion of colors in the urn and we sample balls as described above. We wish to use the information in the sample to estimate the proportion of red balls in the population. What is the best way of doing this? If the observed proportion is 0.25 what conclusions can be drawn about the true but unknown proportion of red balls in the population? Is the observed proportion in the sample consistent with a true proportion of say 0.45, i.e. how likely are the observed data if the true proportion is 0.45? These are the types of questions of interest to statisticians (and to scientists using data to draw inferences to populations or processes from which the data were drawn).
3.2
Five Types of questions in statistics
Although we have already touched some of the central issues of statistics above, we will elaborate how we could understand the enterprise of statistics in terms of five types of questions. Since data/sample are at the heart of statistics and often statistics is regarded as the science of data analysis, our questions will begin with data/sample first. These five questions are as follows: 1. How could one describe the sample effectively? 2. From the sample, how one could make an inference about the total population? 3. How reliable will be the conclusion of our inference? 4. (a) Is there is a relation between two or more variables? (b) If so, then what could we say about the relation? Is it a simple association, or is there a causal relation between them? And finally, 5. How should one collect the sample so that it will help to produce the most reliable estimate? For the sake of discussion, we are going to outline how different sections revolve around one or two of these questions although in this short review we won’t be able to cover all key questions about statistics. (1) pertains to what we ordinarily mean by “statistics” which we take to be a collection of data. We want to know what those data tell us about a single variable representative of some object of interest. Section 4, among other topics, responds to (1). Section 6 which discusses two distinct ways of doing statistical inference, estimation and hypothesis testing, are devoted to addressing (2). In most colleges and universities, these two approaches to doing statistical inference are an essential part of learning statistics. To be able to do statistical inference we need to develop probability models. In one sense, a probability model provides the tools by connecting data to how one should be doing inference in a systematic fashion. Section 5 discusses additional probability results needed to make reliable inferences based on the sample and a probability model. The ideas behind sampling distributions and the central limit theorem
64
Prasanta S. Bandyopadhyay and Steve Cherry
will be expanded in section 5. Section 6 will address (3). Statistics is an ever growing discipline with multi-layered complexity. We cannot address all aspects of this complexity in this short section. As a result, we won’t address (5) in detail although it is important. 4
DATA REPRESENTED AND DESCRIBED
Data/sample are the brick and mortar of statistics, but a data set does not have to be particularly large before a simple listing of data values becomes overwhelming and effective ways of summarizing the information in a data set are needed. The first chapter or two in most introductory textbooks deal with graphical and numerical methods of summarizing data. First, we discuss how the bridge between the observation and data we have, and the world they represent could be built in terms of some non-ambiguous terms, like objects, variables and scales which constitute some basics of mathematical statistics. Second, we introduce three measures of central tendency and point out their implications in critical reasoning. Third, we discuss the variance and the standard deviation, the two key units of measuring dispersion of data.
4.1
Understanding data in terms of objects, variables, and scales
In deductive logic, we attribute truth-values to propositions. In probability theory, we attribute probability values both to events and propositions. In statistics, data which stand for our observations about the world lie at its core. In order for data to be converted into some language which is free from any ambiguity, so that what the data could furnish us with reliable information about the world, we take recourse to language of mathematical statistics. The discussion of data in most introductory statistics textbooks typically starts with the definition of a population as a collection of objects of interest to an investigator. The investigator wishes to learn something about selected properties of the population. Such properties are determined by the characteristics of the individuals who make up the population and these characteristics are referred to as variables because their values vary over the individuals in the population. These characteristics can be measured on selected members of the population. If an investigator has access to all members of a population then he has conducted a census. A census is rarely possible and an investigator will select instead a subset of the population called a sample. Obviously, the sample must be representative of the population if it is to be used to draw inferences to the population from which it was drawn. An important concept in statistics is the idea of a data distribution which is a list of the values and the number of times (frequency) or proportion of the time (relative frequency) those values occur. Variables can be classified into four basic types — nominal, ordinal, interval, and ratio. Nominal and ordinal variables are described as qualitative while interval
Elementary Probability and Statistics: A Primer
65
and ratio scale variables are quantitative. Nominal variables differ in kind only. For example, political party identification is a nominal variable whose “values” are labels; e.g. Democrat, Republican, Green Party. These values do not differ in any quantitative sense. This remains true even if we represent Democrats by 1, Republicans by 2 and so on. The numbers remain just labels identifying group membership without implying that 1 is superior to 2. Because this scaling is not liable to quantification does not mean that it has no value. In fact, it helps us to summarize a large amount of information into a relatively small set of non-overlapping groups of individuals who share a common characteristic. Sometimes the values of a qualitative variable can be placed in a rank order. The latter might stand for the quality of toys received in different overseas cargos. Each toy in a batch receives a quality rating (Low, Medium, and High). They could also be given numerical codes (e.g. 1 for high quality, 2 for medium quality, and 3 for low quality). This ordinal ranking implies a hierarchy of quality in a batch of toys received from overseas. This ranking must satisfy the law of transitivity implying that if 1 is better than 2 and 2 is better than 3 then 1 must be better than 3. Since both nominal and ordinal scales are designated as qualitative variables, they are regarded as non-metric scales. Interval scale variables are quantitative variables with an arbitrarily defined zero value. Put another way, a value of 0 does not mean the absence of whatever is being measured. Temperature measured in degrees Celsius is an interval scale variable. This is a metric scale in which for example the difference between 2 and 5 is the same as the difference between 48 and 51. In contrast to interval scale data, in “ratio” scale data, zero is actually a pointer of “nothing” scored on the scale just as we see zero on a speedometer which signifies no movement of a car. Temperature measured in degrees Kelvin is a ratio scale variable because a value of 0 implies the absence of all motion at the atomic level. Mathematical operations make sense with quantitative data whereas this is not true in general of qualitative data. This should not be taken to mean that qualitative data cannot be analyzed using quantitative methods, however. For example, gender is a qualitative variable and it makes no sense to talk about the “average” gender in a population but it makes a lot of sense to talk about the proportions of men and women in a population of interest.
4.2
Measures of central tendency
The data distribution can be presented graphically (e.g. in a histogram) or tabularly. Graphical analyses are an important and often overlooked aspect of data analysis itself in part because it seems so simple. We do not go into much detail here because we are more interested in statistical inference but we emphasize the importance of graphical summaries as an overall part of an initial data analysis. Graphical summaries can provide a quick overall impression of the data but we need more. One important property of a sample (and by extension of the
66
Prasanta S. Bandyopadhyay and Steve Cherry
population from which it was drawn) is the location of its “center”. We suppose we have a sample of size n from some population of interest. We denote the values of the observations in the data set by x1 , x2 , · · · , xn .There are at least three distinct measures of central tendency: (i) the mode, (ii) the mean and (iii) the median. The mode is the most frequent value in the data. The mean is computed by adding up all the numbers and dividing by the size of the sample. We denote the mean by n 1X x ¯= xi . n i=1 The median is the middle value in the sense that half the values lie at or above the median and half lie at or below it. Computation of the median starts with ordering the data, generally from lowest to highest. If the sample size is odd the median is the middle value. If the sample size is even, the median is the mean of the two middle values. By way of example consider the number of home runs hit by Babe Ruth during his 15 years with the New York Yankees (1920 to 1934), 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22 Forty six home runs appear most often (3 times) and this is the mode. The mean number of home runs hit by Ruth is x ¯ = (1/15)(54 + 59 + · · · + 22) = 659/15 = 43.9. We compute the median by first ordering the data from low to high 22, 25, 34, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60. Since n is odd, we choose the middle value which is 46 (the same as the mode). The median is not unique in this data set. The question of which measure is most appropriate arises immediately. The answer depends on context. The mode is generally easy to identify if one has already carried out a graphical assessment and may be perfectly adequate in some cases. However, it is rarely used in practical statistics anymore. Indeed it is rare to see the mode given in research papers in the biological and ecological sciences and we do not discuss it further here. The mean and median are commonly used as numerical summary statistics, however. Often both are provided. The mean has the advantage that it uses all the data; however it is not resistant to unusual values in the data set. A common example used by one of the authors when he teaches introductory statistics is to imagine the effect on the mean annual income of the people in the classroom if Bill Gates was to walk in the door. The median is resistant to unusual or atypical observations but it only requires one or two data points and thus ignores a lot of information in the data set.
Elementary Probability and Statistics: A Primer
67
We can illustrate these points by considering the graphical representation of idealized data distributions below. The graphs are histograms with smooth curves added. The data in Fig. 1 are skewed to the right (they have a long right tail)., the distribution in Fig. 2 is skewed to the left , and Fig. 3 is symmetric.
Figure 1. Right skewed
Figure 2. Left skewed In each figure the median is the value that has half the area under the curve above it and half the area below it. The mean of the data set depicted in Fig. 1 will be greater than the median because the values in the long right tail “pull” the mean that direction. Similarly the mean in Fig. 2 will be less than the median. The mean and median will be equal to one another in Fig.3. The implication then is that the median is a better measure of central tendency (in the sense of identifying the most typical value) when data distributions are skewed and that it is comparable to the mean when the data distribution is symmetric. This would seem to imply that the median is then superior to the mean, and this argument may well be valid for descriptive data analysis. But the mean is more important for inferential analysis as will be discussed below. Note that the mode would generally not be interpretable as the most typical value when data are skewed. The mode, median, and mean are all equivalent for descriptive purposes when the data distribution is symmetric. However, the mean is more commonly used. Which measure to use depends in large part on how it will be used. Generally, an
68
Prasanta S. Bandyopadhyay and Steve Cherry
Figure 3. Symmetric investigator is interested in drawing an inference from a sample to a population. An investigator may be interested in estimating an unknown population mean defined to be (for a finite population of size N ) µ=
N 1 X xj . N j=1
The sample mean x ¯ would seem to be the natural estimator to use in this case. However if the population distribution is symmetric and the sample is drawn in such a way that it is representative of the population (in which case the sample distribution is approximately symmetric) the population mean could be estimated by any of the central tendency measures described above. Which is the best? Answering this question is one reason why probability theory is so important to the theory of statistic and we will return to this topic below (see section 6). Strictly speaking these measures of center are only appropriate for interval or ratio scale data. They are sometimes applied to ordinal data under the (strong) assumption that the rankings can be treated as numerical. Nominal data are sometimes summarized by providing the proportion of individuals in each category. Proportions can be viewed as means. For example, we have n observations of people in a population classified as male or female we could give a value of 1 to males and a value of 0 to females. We would have a list of n 0s and 1s and the sum of those divided by the sample size would be the proportion of males in the sample.
4.3
Measures of Dispersion
There is an old joke in statistics about a statistician who had his head in an oven and his feet in a freezer. When asked how he felt he replied, “On average I am quite comfortable”. How the values in a data set are spread out around the center is an important part of data description. Such variation is commonplace (indeed it is the rationale for the definition of the term variable) and in many cases it may be as interesting if not more interesting than the center.
Elementary Probability and Statistics: A Primer
69
The appropriate measure of dispersion to use is determined in part by the measure of center used. When the mean is chosen interest centers on quantifying the deviations of the observations about the mean. The sample variance is typically defined to be n 1 X 2 (xi − x ¯) . s2 = n−1 i=1 This is a kind of average squared deviation about the mean. The units of measurement of the sample variance are awkward, (if the variable is cost of a home then the units of the variance are in squared dollars) and, especially for descriptive purposes, the positive square root of the variance is often given. This is referred to as the standard deviation. Both the variance and standard deviation are nonnegative and will be equal to 0 if and only if all the values in a data set are the same. This will never or at least very rarely happen in practice. The sample mean and variance are both sensitive to atypical observations because of their dependence on the sample mean. Why do we divide by n-1?. Not every text does and if the goal is to describe variability in a sample then either one can be used. However, the ultimate goal is to use the sample numerical summary statistics to estimate their unknown population counterparts. Consider a finite population of size N . The population variance is defined to be N 1 X 2 σ2 = (xj − µ) . N j=1 If we are to use the sample variance to estimate the population variance and if we knew the population mean µ then dividing by n in the formula for the sample variance would be appropriate. But as a general rule we will not know the population mean and will estimate it using the sample mean. The variability of the n observations in the sample about the sample mean will tend to be less than their variability about the population mean. Thus, dividing by the sample size will tend to underestimate the population variance. Dividing by n-1 provides the necessary correction to this underestimation. We will return to this topic below when we discuss bias in statistical estimators. As an example consider a simple data set: 10, 12, 14 and 16. The table below summarizes computation of the sample variance and standard deviation. Observation 1 2 3 4 Total
Value 10 12 14 16 52
(xi − x ¯) 10-13=-3 12-13=-1 14-13=1 16-13=3 0
(xi − x ¯) 9 1 1 9 20
2
70
Prasanta S. Bandyopadhyay and Steve Cherry
The sample mean is 52/4=13. The sample variance is 20/3=6.66 and the sample standard deviation is the square root of 6.66 or 2.58. Note that the sum of the deviations in column 3 is 0. This is true in general and is the reason why we cannot just “average” the deviations from the mean. The variance and standard deviation are computed using the mean and are thus naturally paired with the mean. It is not correct, for example, to summarize location using the median and summarizing variability using the standard deviation. If the median is chosen as the measure of center another measure of variability is needed. One such is based on the sum of the absolute deviations from the median. Another cruder measure is the Interquartile Range which is the difference between the third quartile (75th percentile) and first quartile (25th percentile). We will not discuss such measures further here. For most distributions the bulk of the possible values will lie within a couple of standard deviations of the mean. For symmetric distributions like the idealized example in Fig. 4, about 68% of the values will lie within one standard deviation of the mean, about 95% will lie within two standard deviations of the mean and over 99% will lie within three standard deviations of the mean. These figures are roughly accurate for skewed distributions as long as they are not badly skewed. The figures are exact for so-called normal distributions (also referred to as bellshaped or Gaussian distributions). We will discuss this more below.
Figure 4. A graphical representation of the values in an idealized symmetric population with mean 0 and standard deviation 1.
4.4
From descriptive statistics to inference
The material presented in the preceding section is typically referred to as descriptive statistics and no probability is required to understand and to use the methods described above. However, at points in the discussion we alluded to the use of descriptive summary statistics as estimators of unknown population quantities,
Elementary Probability and Statistics: A Primer
71
e.g. using the sample mean to estimate a population mean. The sample mean is based on values observed in a subset of the population and will not in general be equal to the population mean. If we observe a sample mean of say 400 what can we say about the population mean? Is it exactly equal to 400? This is not very likely. Could the population mean be 425 or 350 or 675? In determining how to quantify the uncertainty in the observed value of a statistic such as the sample mean, we ask ourselves the question, “What would we expect to see if the basic sampling process was repeated over and over again, independently, and under the same conditions”?. That is, we conduct the thought experiment of repeating our sampling plan many times, computing the value of the statistic of interest each time and looking at the resulting distribution of those values. We are interested in the long run behavior of the statistic. Obviously we cannot do this in practice. To carry out this thought experiment we must make an assumption about the distribution of the values in the population and about how we draw values from the population. For the latter we must assume that we have a sample that is representative of the population in which we have an interest. Formally, we say that we have a simple random sample from the population, by which we mean that all possible samples of the same size have the same probability of being selected. Once we have a reasonable probability model for the process that generated our data we can use the results of probability theory to investigate the probabilistic behavior of sample statistics of interest. In the next section, we continue our discussion of probability theory with the goal of doing just that. 5
RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
There will be many experimental situations in which interest is primarily focused on numerical characteristics of various outcomes. We might be interested in questions such as “What is the probability of observing 6 or more heads in 10 tosses of a fair coin”?, or perhaps more practically, “What is the probability of surviving five years after being diagnosed with cancer”? We will need to be able to associate each outcome with a number. The rule which determines the association is called a random variable. In some ways this is an unfortunate terminology. Strictly speaking, a random variable is neither random nor a variable – it is a mathematical function mapping outcomes in a sample space to the real line. Typically however, introductory statistics textbooks steer clear of this technical definition and motivate the name as, for example, Devore [2009, p. 87] does when he writes that a random variable is “a variable because different numerical values are possible and random because the observed value depends on which of the possible experimental outcomes results”. We will need to assign probability to the values a random variable can assume. This assignment is done by means of a probability model or probability distribution. We will divide this section into three short subsections. First, we will introduce the notion of a random variable, the expectation of a random variable and some basics of probability distribution. Once we become acquainted with random vari-
72
Prasanta S. Bandyopadhyay and Steve Cherry
ables, we will discuss two types of popular probability distributions. They are, (i) normal distribution and (ii) binomial distribution. We spend some time on the nature of normal distribution and how we could convert it to its standard normal form.
5.1
Random variables and the basics of probability distributions
In simple language, a random variable is a rule for assigning a number to the outcome of a random process. By convention random variables are denoted by upper case letters from the Roman alphabet (X, Y, W , etc). Realizations (i.e. observed values of a random variable are denoted by the corresponding lower case letters (x, y, w, etc). Consider tossing a fair coin three times and recording the outcomes in terms of heads or tails. The sample space is comprised of eight equally likely outcomes S = {HHH, HHT, HT H, T HH, T T H, T HT, HT T, T T T } . Let X = the number of heads observed on any single trial of this experiment. X is a random variable. It can take on one of four values; 0, 1, 2, or 3. The probability that X takes on these values is determined by the probabilities associated with the outcomes in the sample space. For example X =2 if we observe any one of three possible outcomes HHT, HTH, or TTH. Thus, the probability X=2 is P (X = 2) = P (HHH ∪ HHT ∪ T HH) = 3/8 by the addition rule for mutually exclusive events. We summarize all the possibilities in the probability distribution for X : X pr (x) = P (X = x)
0 1/8
1 3/8
2 3/8
3 1/8
Note that all values of pr (x) are between 0 and 1 and sum to 1 so this is a valid assignment of probabilities to the values of the random variable. The probability distribution pr (x) of the random variable X= the number of heads observed in three tosses of a fair coin describes how probability is assigned to the possible values X can assume. Random variables can be discrete or continuous. A discrete random variable is either finite or countably infinite. The random variable X = the number of heads in three tosses of a fair coin is discrete and finite. As an example of an infinite discrete random variable, consider the experiment of tossing a coin until a head is observed. Let X be the number of failures recorded until the heads occurs. Of course, in practice we would never toss the coin forever but there is no well-defined upper limit and it may make sense to model the outcomes of this experiment by assigning X the possible values 0, 1, 2, 3,. . . . Most commonly used discrete
Elementary Probability and Statistics: A Primer
73
random variables are integer valued but they do not need to be. Continuous random variables can take on uncountably many values in an interval on the real line. By uncountable we mean that the values of the random variable cannot be placed in a one-to-one correspondence with the natural numbers. Consider the random variable, X= the length of a randomly selected phone call. Although one could measure the call in terms of its minutes and seconds, the latter still fail to measure it in terms of its other shorter intervals. As a result, it is better to regard it as a continuous random variable capable of taking on any value on the positive real line. In this chapter, we will mostly confine ourselves to discrete random variables. Each random variable is associated with a probability distribution (function), which must satisfy the rules of the probability theory. We saw a simple example above. In general a probability distribution for a discrete random variable (sometimes also called a probability mass function) is defined as follows. A discrete random variable X takes a finite or countably infinite number of values. The values it can assume are determined by the outcomes in the associated sample space, i.e. for every outcome o in the sample space there is an associated value for X. Mathematically then X is a function mapping outcomes o to numbers x, X(o) = x although for convenience we usually suppress the functional notation. The probability distribution (or model) for X is then pr (x) = P (X = x) = P (o ∈ S, X(o) = x). A function pr (x)is a valid probability distribution if 1. 0 ≤ pr (x) ≤ 1 for all x P 2. pr (x) = 1 where the sum is taken over all possible values of x x
The probability P (X ∈ A) is found by summing the probabilities associated with the values x in the event A. Consider tossing a fair die once and recording the number on the upturned face, X. Clearly there are six different values this random variable can take on and each one is equally likely, i.e. P (X = 1) = · · · = P (X = 6) = 1/6. The probability that we observe a value of X greater than or equal to 3 is P (X ≥ 3) = P (X = 3) + P (X = 4) + P (X = 5) + P (X = 6) = 2/3. We will observe different values of a random variable over many repeated trials of its associated experiment. The expected value of a random variable is a weighted average of its possible values. Formally, for a discrete random variable X the expected value of X is defined to be X E(X) = µX = xp(x). x
74
Prasanta S. Bandyopadhyay and Steve Cherry
There are several ways to think about expected values. Given a long-run relative frequency interpretation of probability the expected value can be thought as the long-run average. That is, if we imagine repeating the experiment a large number of times and computing the average of the observed values of the random variable that average should be close to the expected value. Another way to think about it is to imagine a “perfect” sample, i.e. a sample that contains the values we expect it to contain given the probability distribution. The average of that perfect sample is equal to the expected values. X pr (x)
0 0.48
1 0.39
2 0.12
3 0.01
Consider a random variable X with the following probability distribution. Now imagine the experiment is repeated 100 times so that we observe 100 values of X. Based on the probability distribution we expect to see 48 values of 0, 39 values of 1, 12 values of 2, and 1 value of 3. The mean of these 100 numbers will be 48(0) + 39(1) + 12(2) + 3(1) = 0.66 100 which we can rewrite as 0.48(0) + 0.39(1) + 0.12(2) + 3(1) = 0.66 which is the expected value of X We need some way to quantify the variability in those possible values of a random variable. We define the variance of a random variable to be X 2 2 V (X) = σX = pr (x) (x − E(x)) . x
There is a short cut formula which is easier for computational purposes when computing variances by hand X 2 σX = pr (x)x2 − µ2x . x
The variance depends on how often one expects each value of X to occur and how far away the X values are from the expected value of X. One could interpret the variance of X as the weighted average squared distance from E(X), using the probabilities of each value of X as the weights. The standard deviation σX of the random variable is the positive square root of the variance of X. Recall the example above involving tossing a six-sided die once. We let X denote the number of dots on the upturned face and noted that under an assumption that the die was fair we had the following probability distribution P (X = 1) = · · · = P (X = 6) = 1/6.
Elementary Probability and Statistics: A Primer
75
The expected value is µX = (1/6)(1 + 2 + 3 + 4 + 5 + 6) = 3.5. The variance, computed using the short cut formula is 2 = (1/6)(1 + 4 + 9 + 16 + 25 + 36) − 3.52 = 2.92. σX
The standard deviation is 1.71. Two discrete random variables X and Y are said to be independent if P (X = x, Y = y) = P (X = x)P (Y = y) = pr (x)pr (y) for all possible values of x and y. Essentially what this says is that two discrete random variables are independent if the joint probability mass function factors into the product of the marginal mass functions. Modifications of the above arguments are needed for continuous random variables. Whereas probability distributions for discrete random variables assign probability to single values probability distributions for continuous random variables assign probability to intervals of real numbers. The probability distribution for a continuous random variable is called a probability density function. A mathematically valid probability density function is any function f satisfying the following properties, 1. f (x) ≥ 0 for all x 2.
R∞
f (x)dx = 1
−∞
The probability that X takes on a value in the interval (a, b) is
P (a ≤ X ≤ b) =
Zb
f (x)dx.
a
The expected value of a continuous random variable is defined to be µX =
Z∞
xf (x)dx
−∞
The variance of a continuous random variable is defined to be 2 σX
=
Z∞
−∞
2
(x − µX ) f (x)dx.
76
Prasanta S. Bandyopadhyay and Steve Cherry
Unlike discrete random variables the probability that a continuous random variable will take on a single value is 0 P (X = a) =
Za
f (x)dx = 0.
a
Two continuous random variables X and Y are independent if their joint density factors into the product of the marginals f (x, y) = fX (x)fY (y).
There are an infinite number of both discrete and continuous probability distributions. In practice however a relative handful of families of distributions are used. We will briefly consider one discrete distribution and one continuous distribution important in the practice of statistics.
5.2
The normal distribution
The normal distribution is the most important in statistics. It is the distribution people have in mind when they referred to the “bell-shaped” curve and it is also often referred to as the Gaussian distribution after the mathematician Karl Friedrich Gauss. A continuous random variable X is said to have a normal distribution (or be normally distributed) with mean µ and variance σ 2 if its probability density function is „ « (x−µ)2 − 2σ2 1 f (x) = √ e , −∞ < x < ∞. 2πσ The mean can be any real number and the variance any positive real number. We say that X is N µ, σ 2 . There is a different normal distribution for each pair of mean and variance values and it is mathematically more appropriate to refer to the family of normal distributions but this distinction is generally not explicitly made in introductory courses. The history of the normal distribution is fascinating [Stigler, 1986]. It seems to have first appeared in the work of Abraham DeMoivre in the mid-18th century and Gauss found it useful for work he was doing in the late 18th and early 19th centuries. It was imbued with semi-mystical significance initially. Some were impressed by the fact that √ this one distribution contained the three most famous irrational numbers, e, 2, and π. Normally distributed variables were considered to be a law of nature. Although not viewed quite as reverentially today it is still important for reasons which will be discussed in more detail below.
Elementary Probability and Statistics: A Primer
77
The graph of the density function is symmetric about µ. There is a different curve for every µ and σ 2 . In any normal distribution 68% of the data fall within σ (one sigma) of the mean µ, 95% of the data fall within 1.96σ of µ, and 99.7% of the data fall within 3σ of µ. These proportions are the same for any normally distributed population. For simplicity, we frequently convert the values from the units in which they were measured to unitless standard values. Figure 4 shows an example of the so-called standard normal distribution with mean 0 and variance equal to 1 along with an illustration of the 68-95-99.7 rule. To transform a normally distributed random variable into a standard normal random variable we subtract the mean and divide by the standard deviation. The result is typically referred to as a Z score, Z=
X −µ σ
The random variable Z has a N (0, 1) distribution. As an example, suppose the heights of American young women are approximately normally distributed with µ = 65.5 inches and σ = 2.5 inches. The standardized height height − 65.5 Z= 2.5 follows a standard normal distribution. A woman’s standard height is the number of standard deviations by which her height differs from the mean height of all American young women. A woman who is 61 inches tall, for example, has a standard height of 61 − 65.5 Z= = −1.8 2.5 or 1.8 standard deviations less than the mean height. The standard normal distribution is important in introductory classes because it simplifies probability calculations involving normally distributed random variables. Because the normal distribution is a continuous distribution probabilities can be computed as areas under the density curve. But the probability density function does not have a closed form integral solution and those areas must be determined numerically. Further, many introductory courses in statistics do not require calculus as a prerequisite and so integration is not an assumed skill. Tables of probabilities (areas) associated with the standard normal distribution are provided in introductory statistics texts. Finding the probability that a normally distributed random variable Xwith mean µ and variance σ 2 falls in some interval (a, b) is solved by converting to standard units and using the tabled values. Using the standard normal distribution to solve probability problems is no longer of much practical importance because probabilities can now be determined using computer software but the standard normal random variable still plays a major role in statistical inference as we will see. The family of normal distributions has some nice mathematical properties not shared by other probability distributions. These mathematical properties explain,
78
Prasanta S. Bandyopadhyay and Steve Cherry
in part, the importance of the normal distribution in statistics. Users of statistics are often interested in linear transformations of their data or in combining data through the use of linear combinations. For example, the sample mean is a linear combination of observed data. The application of probability theory to data analysis must take this into account. Linear functions and linear combinations of normally distributed random variables have a property that is not shared by other probability distributions that might serve as a model for a data generating process. Suppose X is a normally distributed random variable with mean µ and variance σ 2 . Then any linear transformation of the form Y = a + bX(with b 6= 0) will be normally distributed with mean a + bµ and variance b2 σ 2 . Suppose we have a sequence of independent random variable X1 , X2 , · · · , Xn with means µ1 , µ2 , · · · , µn and variances σ12 , σ22 , · · · , σn2 . Let a1 , a2 , · · · , an be constants. What is the probability distribution of the linear combination Y = a1 X1 + a2 X2 + · · · + an Xn ? It can be shown that if each of the Xi s is normally distributed then the linear combination Y is also normally distributed with mean a1 µ1 + a2 µ2 + · · · + an µn and variance a22 σ12 + a22 σ22 + · · · + a2n σn2 . The important point of the above results is not the resulting means and variances. Those results hold for any probability distribution. What is important and, mathematically speaking, remarkable is that linear transformations and linear combinations of normally distributed random variables are themselves normally distributed. Many commonly used statistical methods start with an assumption that observed data are a representative sample drawn from a population of individuals. A frequent goal is to use summary information in the sample to draw inferences to unknown corresponding quantities in the population. For example, the sample mean of a data set is commonly used to estimate an unknown population mean. Quantifying the uncertainty associated with the estimate requires a probability model. The data are viewed as realizations of a sequence of independent random variables X1 , X2 , · · · , Xn . The sample mean, viewed as a random variable is ¯ = (1/n)(X1 + X2 + · · · + Xn ). X Given the additional assumption that the values in the population can be approx¯ will be imated by a normal distribution with mean µ and variance σ 2 then X 2 normally distributed with mean µ and variance σ /n. We discuss the implications of this result in more detail below. The normal distribution is important in statistics for another reason, a truly remarkable and fascinating result: The Central Limit Theorem (CLT). There are different versions of the CLT but we will consider it as it pertains to the probability distribution of the particular linear combination of random variables called
Elementary Probability and Statistics: A Primer
79
the sample mean. Suppose we have a sequence of independent random variables X1 , X2 , · · · , Xn all sharing the same finite mean µ and finite variance σ 2 . In the context of statistics we think of these random variables as constituting a random sample from a population with mean µ and variance σ 2 . We make no other distributional assumptions about the random variables, i.e. about the distribution of values in the population. The sample mean is ¯ = (1/n)(X1 + X2 + · · · + Xn ). X ¯ The Central Limit Theorem says that if the sample size nis large enough then X 2 will be approximately normally distributed as N µ, σ /n . How large is “large enough”? A value frequently seen in introductory statistics texts is that n ≥ 30. But, like all rules of thumb this one should not be applied indiscriminately. For some well behaved distributions (i.e. symmetric with small variances) sample sizes of 5 to 10 may suffice. For other distributions, especially those with high variance (known as fat or heavy-tailed distributions) the required sample size can be significantly greater than 30.
5.3
The binomial distribution
Consider an experiment that has only two possible outcomes which can be labeled a “success” or a “failure”. The probability of a success is denoted by p and the probability of a failure by 1-p. The simplest such experiment is a coin toss with Heads counting as a “success”. If the coins is fair then p = 0.5. An experiment of this type is called a Bernoulli trial. We will often be interested in counting the number of successes in a sequence of n such trials. The number of successes is a random variable with a probability distribution. We make the following two assumptions: 1. The n trials are independent, i.e. the outcome of any given trial does not affect the outcomes of other trials. 2. The probability of success p is the same on all the trials. Then X = the number of successes is said to have a binomial distribution with parameters n and p, or using a common form of mathematical shorthand we say that X is B in(n,p). The probability distribution (or probability mass function) is n! px (1 − p)n−x p(x) = P (X = x) = x!(n − x)! for x = 0,1, · · · , n. The n! term is called “n factorial” and is defined as n! = n(n − 1)(n − 2) · · · (3)(2)(1) for any positive integer n. We define 0! = 1. This distribution is derived using the rules of probability. Any given sequence of x successes and n − x failures in
80
Prasanta S. Bandyopadhyay and Steve Cherry
has probability px (1 − p)n−x from the assumption of independence and constant probability of success. There are n! x!(n − x)! mutually exclusive sequences with x successes and n − x failures. Thus to compute n! the probability of x successes we add up the x!(n−x)! probabilities px (1 − p)n−x . We have used the multiplication rule for mutually independent events and the addition rule for mutually exclusive events. The mean and variance of a binomially distributed random variable X are 2 µX = np and σX = np(1 − p), respectively. The binomial distribution is a common probability model chosen for data analysis problems involving proportions. Examples include estimating the proportion of voters who favor a particular candidate, the proportion of cancer patients cured by a new treatment, the proportion of individuals in a population who have a particular disease, and so on. The proportion of successes in n Bernoulli trials is pˆ = X/n. The sample proportion is a linear transformation of X and is also a random variable. It has mean µpˆ = p and variance σp2ˆ = p(1 − p)/n, respectively. Use of pˆ as a statistical estimator requires knowledge of its probability distribution and unlike normally distributed random variables linear transformations of binomial random variables are not binomially distributed. However, if the sample size is large enough then the Central Limit Theorem implies that pˆ will be approximately p(1 − p) . N p, n This result follows from the fact that the sample proportion can be considered the mean of a random sample of Bernoulli random variables. In this case “large enough” is a value of n such that both np and n(1 − p) are greater than 10.
5.4
Sampling distributions
We know by now that in statistical inference, we make inferences about a population based on information in a sample which is a subset of the population. Mostly we do not know the details of the population and we will use the information in the sample to estimate the unknown population quantities of interest. Obviously, we require that the sample be representative of the population. We also need a probability model. A population of interest has been identified and interest centers on a numerical characteristic of the population. We call such characteristics parameters and these are constants. A common parameter of interest is the population mean, µ. A sample will be drawn and the sample mean will be used to estimate the population
Elementary Probability and Statistics: A Primer
81
mean. The distribution of the values associated with the individuals in the population is called the population distribution. Prior to data collection we consider the sample to be a sequence of independent random variables. X1 , X2 , · · · , Xn . Each of these random variables has the same probability distribution with mean µ (equal to the population mean) and variance σ 2 (equal to the population variance), The sample mean is of course ¯ = (1/n)(X1 + X2 + · · · + Xn ). X A common assumption is that the population distribution is normal with mean µ ¯ is normally distributed with and variance σ 2 . We know from the above that X mean µ and variance σ 2 /n. This is the probability distribution of the random ¯ In the field of statistics this random variable will be put to a particular variable X. use and statisticians have their own terminology to describe this quantity. It is a statistic and its probability distribution is called a sampling distribution. In general the sampling distribution of a statistic is a description of how probability is assigned to the possible values the statistic can take on. What does the sampling ¯ distribution of Xtell us? We see that the expected value or mean of this statistic is ¯ is an unbiased equal to the parameter it will be used to estimate. We say that X estimator of µ. We see that the variability in the values the sample mean can assume is a function of the population variance and the sample size. In particular ¯ decreases. This is a desirable as the sample size gets larger the variance of X characteristic; the larger the sample, i.e. the more information we have about the ¯ will produce a good estimate of µ. The population, the more we can trust that X sample mean then has two properties we would like to see in any statistic; we want statistics to be unbiased (at least approximately) and we want the variability in the possible values the statistic can assume to decrease with increasing information (larger sample size). The rules of probability mean that an assumption of a normal population distribution results in a normal sampling distribution for the sample mean. For example, we know that prior to data collection there is a √ 95% probability that the calculated value of the sample mean will lie with 1.96 σ/ n units of the population mean. If the assumption of normality for the population is not valid then the sampling distribution of the sample mean will no longer be normal although the sample mean still has expected value (mean) of µ and variance of σ 2 /n. Thus, it is still unbiased and it still has a variance that is decreasing as the sample size increases. Further, if the sample size is large enough, the sampling distribution of the sample mean will still be approximately normal by the Central Limit Theorem and prior to data collection there is an approximate 95%√probability that the calculated value of the sample mean will lie within 1.96 σ/ nunits of the population mean. A related and common application is estimation of a population proportion. The population of interest can be visualized as a population of two values, 0 or 1 with 1 denoting a “success”. A simple random sample of size n will be taken and X = the number of successes will be counted.
82
Prasanta S. Bandyopadhyay and Steve Cherry
The sample proportion pˆ = X/n will be computed and used to estimate the population proportion p. Given these assumptions X will have a binomial distribution with mean np and variance np(1p) and pˆ will have mean p and variance p(1 − p)/n. The statistic pˆ is an unbiased estimator of the parameter p and its variance decreases with increasing sample size. Its sampling distribution is known but is awkward to work with. From the Central Limit Theorem we know that if n is large enough then the sampling distribution of pˆ will be approximately normal with mean p and variance p(1 − p)/n. 6
STATISTICAL INFERENCE
Statistical inference involves drawing conclusions about an entire population based on the information in a sample. For the sake of discussion, we divide this section into three subsections. First, we broach the two ways of doing inference, (i) estimation, and (ii) the test of significance. Second, we distinguish between two ways of conducting estimation, (i) point estimation and (ii) confidence interval estimation. Third and finally, we introduce the concept of the test of significance.
6.1
Two types of ways of doing theory of inference: estimation & testing
Consider first the estimation problem and the test of significance. In the problem of estimation, we determine the values of the parameters, and in the test of significance, we determine whether the result we observe is due to random variation of the sample or due to some other factor. Problems of estimation can be found almost everywhere, in science and in business. Consider an example in physics where Newcomb measured the speed of light. Between July and September 1882, he made 66 measurements of the speed of light. The speed of light is a constant but Newcomb’s measurements varied. Newcomb was interested in determining the “true” value of the parameter; the speed of light. In business, a retailer might like to know something about the average income of families living within 3 miles of her store. Note the difference in the two populations in the examples. The retailer’s population of interest is finite and well-defined. Theoretically she could sample every household of interest and compute the true value of the parameter. Logistically, such a sampling plan, called a census, is often practically impossible and a sample is taken from that population. Newcomb’s population is hypothetical. It is comprised of all possible measurements he could have theoretically made and estimation of the parameter of interest requires an assumption that the 66 observed measurements are a random sample from that population. Scientists and other users of statistics often have a different question of interest. Imagine Newcomb in an argument with another scientist. Newcomb believes the
Elementary Probability and Statistics: A Primer
83
speed of light is equal to one value and the other scientist believes the value is less than that. The goal of the study would have been to assess the strength of evidence in the data for or against these hypotheses. The retailer could be considering an expansion of her store but will not do it unless the mean income of families in the nearby area exceeds a specified threshold. She takes a sample, not so much with the intent of estimating the income but of determining if there is sufficient evidence that it is high enough to justify expansion. Statisticians have argued and continue to argue over the best ways to do these two types of inference. Testing in particular is controversial. Below we will continue with the approach we have taken above and present the basics as they are often presented in introductory courses in statistics. Our goal is to present the concepts, not practically useful methods.
6.2
Two types of estimation problems: point and interval estimation
An investigator is interested in a population, in particular, in the value of some constant numerical quantity θ, called a parameter. The value of θ is unknown. A representative sample of size n will be taken (denoted X1, · · · , Xn ) and information in the sample will be used to estimate the parameter. This is done by distilling the information in the n values in the data set into a single summary quantity called a point estimator or statistic, θˆ . Under an assumption of random sampling the estimator (statistic) is a random variable that is a function of the data, θˆ = θˆ (X1 , · · · , Xn ) . For convenience the functional notation is generally suppressed but it is important to understand that point estimators are random variables that are functions of data. In practice, a single sample is taken and a single value of the point estimator is obtained. That value, a realization of the estimator, is called an estimate. In practice, we only have a single sample and a single estimate but we know that if we took another sample from the same population that we would not get the same value. Recall the above thought experiment of taking realizations of the estimator. The distribution of those values is the sampling distribution of the estimator. Of course, we cannot take very many samples to determine the sampling distribution and this is why probability theory is so important — under some assumptions the sampling distribution of an estimator can be derived using the results of probability. Simon Newcomb took 66 measurements of the time it took light to travel a known distance. The values are shown below. The values are the number of nanoseconds each measurement differed from 0.000024800 seconds. Thus, the first measurement was actually 0.000024828 seconds. Although the speed of light was believed to be constant (at least according to theory) there is variability in the measurements. Denote the true time it takes light to travel the distance in Newcomb’s experiment by µ.
84
Prasanta S. Bandyopadhyay and Steve Cherry
Table 1. Newcomb’s 28 26 33 24 34 -44 27 16 40 -2 29
measurements 22 36 26 24 32 30 21 36 32 25 28 36 30 25 26 23 21 30 29 28 22 31 29 36 19 37 23 24 25 27 20 28 27
of the speed of light data 28 28 27 24 31 25 27 32 26 25 33 29 26 27 32 28 32 29 24 16 39 23
Using the information in the sample requires several assumptions. We assume that µ is the mean of a hypothetical population of all possible measurements. We assume the 66 observed measurements are representative of that population. The measurements are independent of one another in the sense that knowing the value of the ith measurement provides no information about the (i+1)st . Stated another way, we could permute the values of the observations in the table without impacting the analysis. We also assume that the population of possible measurements is normally distributed. This latter assumption is questionable in light of the two outliers -44 and -2. Such atypical values are not unusual in data sets and the question of what to do with them is not easily answered. Here, for simplicity, we omit them and proceed with an analysis of the remaining 64 observations. ¯ normal with mean µ and The sampling distribution of the sample mean Xis √ standard deviation σ/ 64. The point estimate, i.e. the realized value of the sample mean is x ¯ = 27.75. In a sense this is our best guess, given the data, about the value of µ. However, there is uncertainty about this estimate because we know the value of the sample mean varies from sample to sample, That is, if Newcomb had taken another 64 observations it is highly unlikely that he would have observed the same value √ of the sample mean. The variability is quantified by the standard deviation, σ/ 64, which in practice is itself an unknown parameter which must be estimated by the sample standard deviation which is 5.08 for this data set. We will assume,√ unrealistically, that this, in fact, is the population standard deviation and that σ/ 64 = 5.08/8 = 0.64. We have a point estimate and we have quantified the uncertainty associated with this estimate. We go further however and compute an interval estimate of µ, called a confidence interval. Prior to data collection we know, from the properties
Elementary Probability and Statistics: A Primer
85
of the normal distribution, that Z=
¯ −µ X √ σ/ n
follows a standard normal distribution and that ¯ −µ X √ < 1.96 = 0.95. P (−1.96 < Z < 1.96) = P −.196 < σ/ n A relatively simple set of algebraic manipulations leads to the conclusion that ¯ − 1.96 √σ ≤ µ ≤ X ¯ + 1.96 √σ P X = 0.95. n n
¯ − 1.96σ/√n , Thus, prior to data collection we know that the random interval X √ ¯ + 1.96σ/ n will capture or contain µ with probability 0.95. Over the long X run, 95% of all such intervals will contain the population mean but 5% will not. When we plug in the values from Newcomb’s data we have the realized value of the interval, √ x ¯ ± 1.96σ/ n = 27.75 ± 1.96(5.08)/8 = (26.5,29.0). The probability that µ lies between 26.5 and 29.0 is either 0 or 1 because µ is a fixed constant. If µ=23 then the probability it lies between 26.5 and 29 is 0. If µ = 28 then the probability it lies between 26.5 and 29 is 1. However, the process that produced the interval is reliable in that over the long run 95% of all intervals produced in this way will contain µ so we say that we are 95% confident that µ lies in the interval (26.5,29.0). It is possible to compute intervals for other levels of confidence. Let zα/2 be the value from a standard normal distribution with α/2 of the area under the normal curve lying above it, i.e. P Z > zα/2 = α/2. Then a 100(1 − α)% confidence interval for the mean µ of a normally distributed population when the value of σ is know is given by √ x ¯ ± zα/2 σ/ n .
The most common confidence levels used in practice are 90% (z0.05 = 1.645), 95% (z0.025 = 1.96), and 99% (z0.005 = 2.575). Three key assumptions are required for such intervals to be valid:
(i) The data must represent a simple random sample from the population of interest. (ii) The population must be normally distributed.
86
Prasanta S. Bandyopadhyay and Steve Cherry
(iii) The standard deviation of the population is known. By far the most important assumption is (i). One implication of the assumption is that observations can be considered independent of one another. This assump¯ to be tion along with assumption (ii) is required for the sampling distribution of X ¯ that was normal with the indicated mean and variance. It is the normality of X necessary for the above derivation of the confidence interval procedure. As long as the sample size is large enough the procedure is fairly robust to violations of assumptions (ii) and (iii). If the population is not normal but the sample size is large enough then the Central Limit Theorem will result in an approximate normally ¯ and the confidence levels of intervals will be approximately correct. distributed X If the population standard deviation is not known then we estimate it with the sample standard deviation. Given a large enough sample size this estimate should be reasonably close to the true standard deviation, close enough that the intervals will have a true confidence level close to the advertised level. One might wonder why 100% confidence intervals are not computed. The answer is that they are so wide as to be practically useless (a demographer can be 100% confident that the population of the earth is between 0 and one trillion people). Construction of confidence intervals is a balancing act between achieving a high enough level of confidence and a narrow √ enough interval to be useful. The halfwidth of a confidence interval is zα/2 σ/ n. Clearly the larger the population standard deviation the wider an interval will be. Investigators generally have little control over the population standard deviation but do have control over the level of confidence and the sample size. However, specification of a higher level of confidence requires use of a larger zα/2 leading to a wider interval. Larger sample sizes will lead to narrower intervals. However, collecting samples can be expensive and there will often be limits on how large a sample can be selected. Also, because the half-width of the interval is inversely proportional to the square root of the sample size increasing the sample size by a factor of say k does have the same effect of decreasing width. For example, increasing the sample size by 4 times only cuts the interval width in half. It is possible in the simple setting described here to manipulate the confidence interval formula to determine a sample size needed to achieve a specified width and level of confidence. However, this is more difficult in general. Confidence intervals are widely used but remain controversial among many statisticians. The interpretation of confidence intervals in particular is difficult for many students and even more sophisticated users to grasp. It is easy to say that one is 95% confident that an interval contains a parameter of interest but when pressed as to exactly what that means many cannot explain it satisfactorily. They want to say, indeed they almost surely believe, that the interval contains the parameter with 95% probability. Some statisticians empathize with such users. They object to the invention of another way of quantifying uncertainty, especially one difficult to understand. To these statisticians the natural language of uncertainty is probability, but they tend to interpret probability differently from frequentists. The approach taken by these statisticians is typically referred to
Elementary Probability and Statistics: A Primer
87
as Bayesian because of the prominent role by Bayes’ Rule in their methodology. This is perhaps an unfortunate description because Bayes’ Rule is a theorem, consequently it works in reelvant cases quite well as it should regardless of one’s interpretation of probability but it has been used too long to change now. We do not discuss this approach to interval estimation here as our goal is to present a primer of statistics as it is taught in most introductory courses. However, some of the chapters in this volume will discuss the Bayesian approach in more detail.
6.3
The test of significance and hypothesis testing
The question of interest in point and interval estimation is, “What is the value of an unknown parameter and how do we quantify our uncertainty about our estimate of the parameter?” Testing is motivated by another question; “How consistent are the data with a stated claim or hypothesis about the value of a parameter?” The hypothesis takes the form of specifying a particular value or of specifying a range of values within which the true value lies. Implicit in the specification of a hypothesis about the value of a parameter is another hypothesis (the negation of the first) that the value of a parameter is something else. The two hypotheses are called the null hypothesis (H0 ) and the alternative hypothesis (Ha ). The null hypothesis specifies a value or range of values of a parameter and the alternative is the value or range of values deemed plausible if the null hypothesis is not true. Confusingly there are two basic approaches to testing encountered in the introductory statistical literature; the test of significance and the hypothesis test. The first approach is attributed to the English statistician and population geneticist Sir Ronald A. Fisher. Hypothesis testing is attributed to the Polish mathematician Jerzy Neyman and the English statistician Egon Pearson. Different texts treat the topic in different ways. Devore [2008] discusses Neyman-Pearson hypothesis testing. DeVeaux, Velleman, and Bock [2008] place more emphasis on significance testing although they briefly describe the Neyman-Pearson approach. In general, the discussion of testing in such textbooks takes a middle ground between the two approaches that neither Fisher nor Neyman-Pearson would have approved of. We will start with significance testing and have a little to say about hypothesis testing at the end. The following simple example illustrates the basic reasoning in a test of significance. An hypothesis is stated along with an alternative. Data are collected and steps are taken to determine if the data are consistent with the null hypothesis. In this example, it is possible that the data are consistent with the null hypothesis and the decision is to fail to reject the null. If the data are not consistent with the null hypothesis then it might be decided to reject the null. In practice, the data may be equivocal and no decision is warranted. Or it may be that different investigators make a different decision given the same data. We now describe the procedure for tests of significance about population means. We will assume that we have a simple random sample of size n from a normally distributed population with unknown mean µ and variance σ 2 (or standard de-
88
Prasanta S. Bandyopadhyay and Steve Cherry
viation σ). For definiteness, suppose that another physicist working at the same time as Newcomb believed that the speed of light was such that the mean time it took light to travel the distance in Newcomb’s experiment was 29 (using the scaled units from above). We suppose (in our fictional account) that Newcomb did not believe this physicist was correct. Being a scientist he was willing to put his supposition to the test. Further, he was willing to start by assuming that he was wrong and then use the data to show that this assumption is not tenable. This is a classic argument by contradiction. Newcomb did not have the tools to carry out the test that we have but we will see what would have happened if he had. Denoting the true time it takes light to travel the distance in the experiment by µ the null and alternative hypotheses are H0 : µ = 29 . Ha :µ 6= 29 We need a test statistic; a means of quantifying the evidence against the null ¯ Note that hypothesis. It is natural to base the test statistic on the sample mean X. both large and small values of the sample mean would provide evidence against the null and for the alternative. We will use the 64 measurements given above. Recall that we assumed that the standard deviation was known, σ = 5.08 and that we computed the realized value of the sample mean to be x ¯ = 27.75. Is this consistent with the null hypothesis, i.e. could a value of 27.75 plausibly be observed if the true mean is indeed 29? We know that the sampling distribution of the sample ¯ is normal with mean µ and standard deviation σ/√n = 5.08/8 = 0.635. If mean X the null hypothesis is true then the mean is 29. Thus, probability theory gives us the sampling distribution of the sample mean under an assumption that the null hypothesis is true. We can use this information to see how unusual the observed value of the sample mean is given an assumed population mean of 29. The test statistic commonly used is ¯ − 29 X Z= . 0.635 Under the assumptions Z will have a standard normal distribution. We compute the value of the test statistic z=
27.75 − 29 = −1.97. 0.635
If the null hypothesis is true then the observed value of the sample mean lies 1.97 standard deviations below the hypothesized mean of 29. In our example, Newcomb did not seem to have a direction in mind initially, i.e. he simply stated his belief that the true mean was not equal to 29. This is an example of a two-sided (or twotailed) test because as noted above both large and small values of the sample mean would be unusual under the null hypothesis. Of course, any single observation is probabilistically uncertain (with our underlying continuous probability model any single observation has probability 0). We quantify the evidence against the null by computing the probability of observing a value of the test statistic as extreme
Elementary Probability and Statistics: A Primer
89
or more extreme than the one we actually observed COMPUTED UNDER AN ASSUMPTION THAT THE NULL HYPOTHESIS IS TRUE. This probability is referred to as a P -value. Given the two-sided nature of the test we compute P (Z ≤ −1.97) + P (Z ≥ 1.97) = 0.049. This calculation was carried out assuming that Z is a standard normal random variable which is true if the null hypothesis is true. Thus, if the null hypothesis is true we would expect to see a value of the sample mean of 27.75 or something more “extreme” (less than 27.75 or greater than 30.25) only about 4.9% of the time if the experiment were repeated over and over again, independently, and under the same conditions. Is this evidence strong enough to reject the null hypothesis? A frequently used cut-off value between “strong enough” and “not strong enough” is 0.05. If the P -value is less than or equal to 0.05 then reject the null hypothesis and if the P -value is greater than 0.05 then fail to reject the null hypothesis. If one adheres strictly to this criterion then the evidence would be strong enough to reject. The selection of a cut-point between reject and fail-to-reject is called fixed-level testing. The cut-point itself is referred to as an α-level. Other commonly chosen levels are α = 0.01 and α = 0.10. These three values (0.01, 0.05, and 0.10) have become so ingrained in statistical practice that many people assume there is some scientific justification for them but that is not true. Two different people could come to different conclusions about the results of a test based on their particular selection of a significance level. If Newcomb chose α = 0.05 he would reject the null hypothesis but his imaginary opponent could have demanded stronger evidence say by choosing α = 0.01in which case he would fail to reject the null hypothesis. We have seen a specific example of a two-sided test. More generally a two-sided test consists of the following steps. 1. State the null and alternative hypotheses: H0 :µ = µ0 . Ha :µ 6= µ0 2. Compute the value of the test statistic; z=
x ¯ − µ0 √ . σ/ n
3. Compute the P -value, P − value = 2P (Z ≥ |zobs |) .
90
Prasanta S. Bandyopadhyay and Steve Cherry
where Z is assumed to be standard normal and zobs is the observed value of the test statistic. There are two other one-sided versions of significance tests. The upper-tailed version has null and alternative hypotheses of H0 :µ ≤ µ0 Ha :µ > µ0 And the lower-tailed version has null and alternative hypotheses of H0 :µ ≥ µ0 Ha :µ < µ0 The test statistic is the same in all cases. The P -value for the upper-tailed version is the probability of getting a value of the test statistic equal to or greater than the one actually observed computed under an assumption that the true mean is µ0 . If the evidence against the null is strong enough to reject the null when the true mean is µ0 it will be even stronger when µ < µ0 . Similarly, the P -value for the lower-tailed version is the probability of getting a value of the test statistic equal to or less than the one actually observed computed under an assumption that the true mean is µ0 . If fixed-level testing is desired then an appropriate significance level is chosen and the decision of whether to reject or fail to reject the null hypothesis is based on a comparison of the P -value to the significance level. Hypothesis testing is carried out in a manner that is mechanically identical to the above. A key difference is that interest centers on controlling the rates at which various errors are made. Hard and fast rules are specified for whether or not to reject the null hypothesis. The probability of rejecting the null hypothesis when µ = µ0 is called a Type I error and the significance level is the rate at which that error will occur. The probability of failing to reject the null hypothesis when it is false is called a Type II error. The power of a test is 1 − P (Type II error). In practice an investigator decides how much of a difference between the true mean and the hypothesized value µ0 he is willing to accept, i.e. he decides how false the null has to be before he needs a high probability of detecting it. Obviously the ideal hypothesis test will have low rates of both types of errors but specifying a small chance of a Type I error leads to a larger chance of a Type II error. This tension between the two types of errors is frequently misunderstood. An investigator will choose a low level of significance (a small chance of committing a Type I error), say 1%, failing to understand that the probability of a Type II error can be quite high. Hypothesis testing is frequently characterized as a decision making process. As such a null hypothesis may be “accepted”, not because it is believed to be true but because it is believed to be true enough to justify acting as if it is true. It may make sense in applications in quality control, for example. A shipment of raw material arrives at a plant and it must be accepted or rejected. Type I
Elementary Probability and Statistics: A Primer
91
and Type II error rates are determined based on economic considerations and on purity considerations. It is known they over the long run a certain proportion of acceptable shipments will be rejected and a certain proportion of unacceptable shipments will be accepted but those error rates are economically justifiable. As mentioned above Fisher is generally credited with inventing significance testing and Neyman and Pearson credited with hypothesis testing. Fisher, who was a scientist, was critical of the Neyman-Pearson approach. He did not concern himself with Type II errors. He did not see science as a decision making process. Neyman-Pearson did not use P -values in their original work. They were interested in controlling error rates. The test statistics in tests of population means were sample means themselves. They used the specified values of error rates to find so-called rejection and acceptance regions for values of the sample mean. If an observed value fell in a rejection region the null was rejected. If a value fell in an acceptance region the null was “accepted”. Later, it was noticed that the P -value could be used to determine if the sample mean was in or out of a rejection region. If the P -value was less than the significance level then the value of the sample mean fell in the rejection region and vice versa. But the size of the P -value meant little in Neyman-Pearson hypothesis testing. If the significance level was chosen to be 0.05 then a P -value of 0.049 meant the same as a P -value of 0.00000006, and a P -value of 0.0495 would lead to a decision to reject the null while one of 0.0505 would lead to a decision to fail to reject the null. Fisher believed the observed size of the P -value was important. We will not go into the details of the distinction between the two methods of testing. Royall [1997] has a good discussion of the differences. We will also not concern ourselves with the controversies that still surround this topic. Some of those controversies are addressed in chapters to follow. Our goal here has been merely to provide an overview of the main results from probability theory and mathematical statistics to readers who may not have seen this material before or to those who saw it a long time ago. Also, we have presented the basic concepts of estimation and testing hypothesis about population means and with the unreasonable assumption that the population standard deviation is known. There are many more parameters of interest (variables and standard deviations, correlation coefficients, differences in means, slopes and intercepts of regression lines, etc), and different assumptions are needed in many other cases (the population standard deviation is rarely if ever known in practice). The details differ but the broad concepts presented here remain the same and those are the concepts we wanted to address. 7
CONCLUSION
It is time to wrap up our primer on probability and statistics. Before we conclude we revisit where we began our journey. We defined a notion called “monotonic reasoning” that underlies deductive consequence relation which, in fact, captures the core of deductive reasoning. We have seen that a deductive consequence relation is
92
Prasanta S. Bandyopadhyay and Steve Cherry
by definition monotonic. The way we have understood monotonic reasoning is this: A relation between a set of sentences and a sentence is monotonic if and only if when it also holds between a set and a sentence, it also holds between any superset of the set and the sentence. We mentioned that probability theory provides tools for both understanding and handling inductive arguments. However, there is a possibility of misunderstanding that whenever the expression” probability” occurs in statistical or philosophical literature, it necessarily means that the context is inductive, and therefore, cannot be monotonic. We will show that there are contexts in which although the expression “probability” occurs they do not have anything to do with inductive inferences. To drive this point home, we make a distinction between “probabilistic inference” and “probabilistic conclusion”. By “probabilistic inference,” we mean an inference engine that takes statements as inputs and spits out as its output a conclusion where the inference itself is probabilistic. All inductive inferences are probabilistic inferences in this sense. Consequently, they are non-monotonic. We mean by “probabilistic conclusion” an argument whose conclusion contains a probabilistic statement. An argument with a probabilistic conclusion need not be inductive. Hence, it need not involve non-monotonic reasoning. In the case of an inference involving probability, its conclusion could be a categorical statement in which the conclusion does not follow necessarily from the premises of the argument. Since it is concerned with inference itself the inference is sometimes called “uncertain inference.” However, in other cases of inference, the conclusion of that inference could be a probabilistic statement, but it could follow deductively from its premises. It is a case of inference about uncertainty (See, [Hempel, 1965; Kyburg, 2000; Bandyopadhyay and Bennett, 2004]). We get four possible combinations between probabilistic conclusion and probabilistic inference. They are as follows: Probabilistic conclusion, but not probabilistic inference. Probabilistic inference, but no probabilistic conclusion. Probabilistic inference and probabilistic conclusion. And Neither probabilistic conclusion nor probabilistic inference. (i) Probabilistic conclusion, but not a case of probabilistic inference. P1: This die has six faces, labeled 1, 2, 3, 4, 5, 6. P2: Each face is equally probable. C: The probability of rolling a 3 on this die is 1/6 The conclusion makes a statement about the probability of a certain event and the conclusion follows deductively from the premises. Given the probability calculus, it is a deductively valid argument in which if the premises are true, then the conclusion must be true, too. If we add a new premise to this argument then we won’t be able to undermine its deductive validity.
Elementary Probability and Statistics: A Primer
93
Therefore, it involves monotonic reasoning. At this point, we want to mention that Simpson’s paradox which is a well-known paradox in probability and statistics in fact falls under this category. Simpson’s Paradox involves the reversal of the direction of a comparison or the cessation of an association when data from several groups are combined to form a single whole (see the “Introduction” together with H´ajek’s paper in the volume for this position). This shows that an argument with a probabilistic conclusion does not necessarily make it a probabilistic inference; therefore, the argument in question does not stand for an inductive inference. (ii) Probabilistic inference, but no probabilistic conclusion. P1. 95% of the balls in the urn are red. P2: A ball is randomly selected from the urn. C: This ball drawn will be red. The conclusion does not follow deductively from its premises, although it is a categorical statement about a particular ball, which may be false. Given a long-run relative frequency interpretation of probability we expect to see a red ball about 95% of the time with repeated draws. We might be willing to bet a large sum of money that a particular draw will be red, but we cannot state this unequivocally. It will be fair to say that even though the premises are true, the conclusion could be very well false. Hence, the inference is not deductive. As a result, the reasoning involved is non-monotonic and therefore, this is an example of inductive inference. (iii) Probabilistic inference and probabilistic conclusion The following is an example in which the inference is probabilistic as well as its conclusion. P1. The die has six faces, labeled 1, 2, 3, 4, 5, 6 P2. In a sequence of 250 rolls, a 3 was rolled 50 times. C: The probability of rolling a 3 with this die is about 1/5. In this situation, we happen to observe 50 threes in 250 rolls of a fair die. The probability of obtaining a three on a single roll is 1/5. It is an instance of inductive inference since even though the premises are true the conclusion could be false. Therefore, the reasoning the argument employs is non-monotonic. To understand the reasoning involved in (iii), compare (iii) with (i). In (iii) we have a specific outcome of a random process and we are reasoning from the sample to the population. In this sense, we are going from the specific to the general. In (i), however, we are reasoning from the population process to the behavior of samples over the long run. Here, we are making an inference about a specific outcome from a general claim.
94
Prasanta S. Bandyopadhyay and Steve Cherry
Therefore, (i) is a case of deductive inference involving monotonic reasoning, whereas (iii) is a case of inductive inference involving non-monotonic reasoning. (iv) Neither probabilistic conclusion nor probabilistic inference. Consider an example which does not involve probabilistic inference nor does it involve probabilistic conclusion. P1. Wherever there is smoke, there is fire. P2: The hill has smoke. C: There is fire on the hill. Neither its premises nor its conclusion contain a probabilistic claim. This example is a case of non-probabilistic inference (deductive argument) with a non-probabilistic (categorical) conclusion. Therefore, from the perspective of our probabilistic curiosity, it becomes uninteresting. Although this specific argument is deductively valid and therefore, involves monotonic reasoning, any example under this rubric need not be deductively valid. Since our present interest lies in probabilistic reasoning, an example of this specific argument is of no concern for us. This ends our primer where we have discussed basic idea and theories behind probability and statistics. We have also discussed some of the fundamental differences between inductive and deductive arguments. We hope that this will provide a background for the general reader to read and appreciate many of the papers in the volume. ACKNOWLEDGEMENTS We would like to thank Abhijit Dasgupta, Malcolm Forster, John G. Bennett, and Monami Chakrabarti for their comments and suggestions regarding our chapter which have helped to improve the chapter considerably. PSB’s research has been supported by Montana State University’s NASA Astrobiology research center grant (♯4w1781). BIBLIOGRAPHY [Bandyopadhyay and Bennett, 2004] P. Bandyopadhyay and J. G. Bennett. Commentary in The Nature of Evidence, M. Taper and S. Lele, eds. University of Chicago Press, 2004. [DeVeaux et al., 2008] R. DeVeaux, P. Vellman, and D. Bock. Stats: Data and Models, Second Edition. Pearson-Addison Wesley, New York, NY. USA, 2008. [Devore, 2008] J. Devore. Probability and Statistics for Engineering and the Sciences, Seventh Edition. Brooks-Cole, Belmont, CA. USA, 2008. [H´ ajek, 2007] A. H´ ajek. Interpretations of Probability in Stanford Encyclopedia of Philosophy, 2007. [Hempel, 1965] C. G. Hempel. Aspects of Scientific Explanation. New York: Free Press, 1965.
Elementary Probability and Statistics: A Primer
95
[Kyburg, 2000] H. Kyburg. Probable Inference and Probable Conclusion. In Science, Explanation, and Rationality. J. Fetzer (ed.) Oxford University Press., New York: New York, 2000. [Moore and McCabe, 2006] D. Moore and G. McCabe. An Introduction to the Practice of Statistics, Fifth Edition. W.H. Freeman & Company, New York, NY. USA, 2006. [Royall, 1997] R. Royall. Statistical Evidence: A likelihood paradigm. Chapman & Hall: London, 1997. [Stigler, 1986] S. Stigler. History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press: Cambridge: Mass, 1986.
This page intentionally left blank
Part II
Philosophical Controversies about Conditional Probability
This page intentionally left blank
CONDITIONAL PROBABILITY Alan H´ajek 1
INTRODUCTION
A fair die is about to be tossed. The probability that it lands with ‘5’ showing up is 1/6; this is an unconditional probability. But the probability that it lands with ‘5’ showing up, given that it lands with an odd number showing up, is 1/3; this is a conditional probability. In general, conditional probability is probability given some body of evidence or information, probability relativised to a specified set of outcomes, where typically this set does not exhaust all possible outcomes. Yet understood that way, it might seem that all probability is conditional probability — after all, whenever we model a situation probabilistically, we must initially delimit the set of outcomes that we are prepared to countenance. When our model says that the die may land with an outcome from the set {1, 2, 3, 4, 5, 6}, it has already ruled out its landing on an edge, or on a corner, or flying away, or disintegrating, or . . . , so there is a good sense in which it is taking the non-occurrence of such anomalous outcomes as “given”. Conditional probabilities, then, are supposed to earn their keep when the evidence or information that is “given” is more specific than what is captured by our initial set of outcomes. In this article we will explore various approaches to conditional probability, canvassing their associated mathematical and philosophical problems and numerous applications. Having done so, we will be in a better position to assess whether conditional probability can rightfully be regarded as the fundamental notion in probability theory after all. Historically, a number of writers in the pantheon of probability took it to be so. Johnson [1921], Keynes [1921], Carnap [1952], Popper [1959b], Jeffreys [1961], Renyi [1970], and de Finetti [1974/1990] all regarded conditional probabilities as primitive. Indeed, de Finetti [1990, 134] went so far as to say that “every prevision, and, in particular, every evaluation of probability, is conditional; not only on the mentality or psychology of the individual involved, at the time in question, but also, and especially, on the state of information in which he finds himself at that moment”. On the other hand, orthodox probability theory, as axiomatized by Kolmogorov [1933], takes unconditional probabilities as primitive and later analyses conditional probabilities in terms of them. Whatever we make of the primacy, or otherwise, of conditional probability, there is no denying its importance, both in probability theory and in the myriad applications thereof — so much so that the author of an article such as this faces hard choices of prioritisation. My choices are targeted more towards a philosophical audience, although I hope that they will be of wider interest as well. Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
100
Alan H´ ajek
2
2.1
MATHEMATICAL THEORY
Kolmogorov’s axiomatization, and the ratio formula
We begin by reviewing Kolmogorov’s approach. Let Ω be a non-empty set. A field (algebra) on Ω is a set F of subsets of Ω that has Ω as a member, and that is closed under complementation (with respect to Ω) and union. Assume for now that F is finite. Let P be a function from F to the real numbers obeying: 1. P (A) ≥ 0 for all A ∈ F.
(Non-negativity)
2. P (Ω) = 1.
(Normalization)
3. P (A ∪ B) = P (A) + P (B) for all A, B ∈ F such that A ∩ B = ∅ (Finite additivity) Call P a probability function, and (Ω, F, P ) a probability space. One could instead attach probabilities to members of a collection of sentences of a formal language, closed under truth-functional combinations. Kolmogorov extends his axiomatization to cover infinite probability spaces. Probabilities are now defined on a σ-field (σ-algebra) — a field that is further closed under countable unions — and the third axiom is correspondingly strengthened: 3′ . If A1 , A2 , . . . is a countable sequence of (pairwise) disjoint sets, each belonging to F, then ∞ ∞ S P P( An ) = P (An ) (Countable additivity) n=1
n=1
So far, all probabilities have been unconditional. Kolmogorov then introduces the conditional probability of A given B as the ratio of unconditional probabilities: (RATIO)
P (A|B) =
P (A ∩ B) , provided P (B) > 0 P (B)
(On the sentential formulation this becomes: P (A|B) =
P (A&B) , provided P (B) > 0.) P (B)
This is often called the “definition” of conditional probability, although I suggest that instead we call it a conceptual analysis 1 of conditional probability. For ‘conditional probability’ is not simply a technical term that one is free to introduce however one likes. Rather, it begins as a pre-theoretical notion for which we have 1 Or (prompted by Carnap [1950] and Maher [2007]), perhaps it is an explication. I don’t want to fuss over the viability of the analytic/synthetic distinction, and the extent to which we should be refining a folk-concept that may not even be entirely coherent. Either way, my point stands that Kolmogorov’s formula is not merely definitional.
Conditional Probability
101
associated intuitions, and Kolmogorov’s ratio formula is answerable to those. So while we are free to stipulate that ‘P (A|B)’ is merely shorthand for this ratio, we are not free to stipulate that ‘the conditional probability of A, given B’ should be identified with this ratio. Compare: while we are free to stipulate that ‘A ⊃ B’ is merely shorthand for a connective with a particular truth table, we are not free to stipulate that ‘if A, then B’ in English should be identified with this connective. And Kolmogorov’s ratio formula apparently answers to most of our intuitions wonderfully well.
2.2
Support for the ratio formula
Firstly, it is apparently supported on a case-by-case basis. Consider the fair die example. Intuitively, the probability of ‘5’, given ‘odd’, is 1/3 because we imagine narrowing down the possible outcomes to the three odd ones, observing that ‘5’ is one of them and that probability is shared equally among them. And the ratio formula delivers this verdict: P (5|odd ) =
1/6 P (5 ∩ odd) = = 1/3. P (odd) 1/2
And so it goes with countless other examples. Secondly, a nice heuristic for Kolmogorov’s axiomatization is given by van Fraassen’s [1989] “muddy Venn diagram” approach, which suggests an informal argument in favour of the ratio formula. Think of the usual Venn-style representation of sets as regions inside a box (depicting Ω). Think of probability as mud spread over the diagram, so that the amount of mud sitting above a given region corresponds to its probability, with a total amount of 1 unit of mud. When we consider the conditional probability of A, given B, we restrict our attention to the mud that sits above the region representing B, and then ask what proportion of that mud sits above A. But that is simply the amount of mud sitting above A ∩ B, divided by the amount of mud sitting above B. Thirdly, the ratio formula can be given a frequentist justification. Suppose that we run a long sequence of n trials, on each of which B might occur or not. It is natural to identify the probability of B with the relative frequency of trials on which it occurs: #(B) P (B) = n Now consider among those trials the proportion of those on which A also occurs: P (A|B) = But this is the same as
#(A ∩ B) #(B)
#(A ∩ B)/ n #(B)/ n
102
Alan H´ ajek
which on the frequentist interpretation is identified with P (A ∩ B) . P (B) Fourthly, the ratio formula for subjective conditional probability is supported by an elegant Dutch Book argument originally due to de Finetti [1937] (here simplified). Begin by identifying your subjective probabilities, or credences, with your corresponding betting prices. You assign probability p to X if and only if you regard pS as the value of a bet that pays S if X, and nothing otherwise. Symbolize this bet as: S if X 0 otherwise For example, my credence that this coin toss results in heads is 1/2, corresponding to my valuing the following bet at 50 cents: $1 0
if heads otherwise
A Dutch Book is a set of bets bought or sold at such prices as to guarantee a net loss. An agent is susceptible to a Dutch Book if there exists such a set of bets, bought or sold at prices that she deems acceptable. Now introduce the notion of a conditional bet on A, given B, which • pays $1 if A ∩ B • pays 0 if Ac ∩ B • is called off if B c (that is, the price you pay for the bet is refunded if B does not occur). Identify your P (A|B) with the value you attach to this conditional bet — that is, to: $1 0 P (A|B)
if A ∩ B if Ac ∩ B if B c
Now we can show that if your credences violate (RATIO), you are susceptible to a Dutch Book. For the conditional bet can be regarded as equivalent (giving the same pay-offs in every possible outcome) to the following pair of bets: $1 0 0
if A ∩ B if Ac ∩ B if B c
0 if A ∩ B 0 if Ac ∩ B P (A|B) if B c
Conditional Probability
103
which you value at P (A ∩ B) and P (A|B)P (B c ) respectively. So to avoid being Dutch Booked, you must value the conditional bet at P (A ∩ B) + P (A|B)P (B c ). That is: P (A|B) = P (A ∩ B) + P (A|B)P (B c ), from which the ratio formula follows (since P (B c ) = 1 − P (B)).
2.3
Some basic theorems involving conditional probability
With Kolmogorov’s axioms and ratio formula in place, we may go on to prove a number of theorems involving conditional probability. Especially important is the law of total probability, the simplest form of which is: P (A) = P (A|B)P (B) + P (A|B c )P (B c ). This follows immediately from the additivity formula P (A) = P (A ∩ B) + P (A ∩ B c ) by two uses of (RATIO). The law generalizes to the case in which we have a countable partition B1 , B2 , . . . : P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + . . . This tells us that the unconditional probability P (A) can be identified with a weighted average, or expectation, of probabilities conditional on each cell of a partition, the weights being the unconditional probabilities of the cells. We will see how the theorem underpins Kolmogorov’s more sophisticated formulation of conditional probability (§5), and a rule for updating probabilities (Jeffrey conditionalization, §7.2). In the meantime, notice how it yields versions of the equally celebrated Bayes’ theorem: P (A|B)
= =
P (B|A)P (A) P (B) P (B|A)P (A) P (B|A)P (A) + P (B|AC )P (AC )
(by two uses of (RATIO)) (by the law of total probability)
More generally, suppose there is a partition of hypotheses {H1 , H2 , ...}, and evidence E. Then for each i, P (E|Hi )P (Hi ) P (Hi |E) = P P (E|Hj )P (Hj ) j
The P (E|Hi ) terms are called likelihoods. Bayes’ theorem has achieved such a mythic status that an entire philosophical and statistical movement — “Bayesianism” — is named after it. This is meant
104
Alan H´ ajek
to honour the important role played by Bayes’ theorem in calculating terms of the form ‘P (Hi |E)’, and (at least among philosophers) Bayesianism is particularly associated with a subjectivist interpretation of ‘P ’, and correspondingly with the thesis that rational degrees of belief are probabilities. This may seem somewhat curious, as Bayes’ theorem is neutral vis-`a-vis the interpretation of probability, being purely a theorem of the formal calculus, and just one of many theorems at that. In particular, it provides just one way to calculate a conditional probability, when various others are available, all ultimately deriving from (RATIO); and as we will see, often conditional probabilities can be ascertained directly, without any calculation at all. Moreover, a diachronic prescription for revising or updating probabilities, which we will later call ‘conditionalization’, is sometimes wrongly called ‘updating by Bayes’ theorem’. In fact the theorem is a ‘static’ rule relating probabilities synchronically, and being purely a piece of mathematics it cannot by itself have any normative force.
2.4
Independence
Kolmogorov’s axioms assimilate probability theory to measure theory, the general theory of length, area, and volume. (Think of how these quantities are nonnegative, additive, and can often be normalized.) Conditional probability is a further, distinctively probabilistic notion without any obvious counterpart in measure theory. Similarly, independence is distinctively probabilistic, and ultimately parasitic on conditional probability. Let P (A) and P (B) both be positive. According to Kolmogorov’s theory, A and B are independent iff P (A|B) = P (A); equivalently, iff P (B|A) = P (B). These equations are supposed to capture the idea of A and B being uninformative regarding each other: that one event occurs in no way affects the probability that the other does. To be sure, there is a further equivalent characterization of independence that is free of conditional probability: P (A ∩ B) = P (A)P (B). But its rationale comes from the equalities of the corresponding conditional and unconditional probabilities. It has the putative advantage of applying even when A and/or B have probability 0; although it is questionable whether probability 0 events should automatically be independent of everything (including themselves, and their complements!). When we say that ‘A is independent of B’, we suppress the fact that such independence is really a three-place relation between an event, an event, and a probability function. This distinguishes probabilistic independence from such two-place relations as logical and counterfactual independence. Probabilistic independence is assumed in many of probability theory’s classic limit theorems.
Conditional Probability
105
We may go on to give a Kolmogorovian analysis of conditional independence. We say that A is independent of B, given C, if P (A ∩ B|C) = P (A|C)P (B|C), provided P (C) > 0.
2.5
Conditional expectation
Conditional probability underlies the concept of conditional expectation, also important in probability theory. A random variable X (on Ω) is a function from Ω to the set of real numbers, which takes the value X(ω) at each point ω ∈ Ω. If X is a random variable that takes the values x1 , x2 , . . . with probabilities p(x1 ), p(x2 ), . . . , then the expected value of X is defined as X E(X) = xi p(xi ) i
provided that the series converges absolutely. (For continuous random variables, we replace the p(xi ) by values given by a density function, and replace the sum by an integral.) A conditional expectation is an expectation of a random variable with respect to a conditional probability distribution. Let X and Y be two random variables with joint distribution P (X = xj ∩ Y = yk ) = p(xj , yk )(j, k = 1, 2, . . .) The conditional expectation E(Y |X = xj ) of Y given X = xj is given by: P yk p(xj , yk ) X yk P (Y = yk |X = xj ) = k p(xj ) k
(Again, this has a continuous version.) We may generalize to conditional expectations involving more random variables. So the importance of conditional probability in probability theory is beyond dispute. The same can be said of its role in many philosophical applications. 3
PHILOSOPHICAL APPLICATIONS
Conditional probability is near-ubiquitous in both the methodology — in particular, the use of statistics and game theory — of the sciences and social sciences, and in their specific theories. Various central concepts in statistics are defined in terms of conditional probabilities: significance level, power, sufficient statistics, ancillarity, maximum likelihood estimation, Fisher information, and so on. Game theorists appeal to conditional probabilities for calculating the expected payoffs in correlated equilibrium; computing the Bayesian equilibrium in games of incomplete information; in certain Bayesian dynamic updating models of equilibrium selection in repeated games; and so on. Born’s rule in quantum mechanics is often
106
Alan H´ ajek
understood as a method of calculating the conditional probability of a particular measurement outcome, given that a measurement of a certain kind is performed on a system in a certain state. In medical and clinical psychological testing, conditional probabilities of the form P (disorder | positive test) (“diagnosticity”) and P (positive test | disorder) (“sensitivity”) take centre stage. Mendelian genetics allows us to compute probabilities for an organism having various traits, given information about the traits of its parents; and population genetics allows us to compute the chance of a trait going to fixation, given information about population size, initial allele frequencies, and the fitness gradient. The regression equations of economics, and many of the results in time series analysis, are claims about conditional probabilities. And so it goes — this little sampler could be extended almost indefinitely. Moreover, conditional probability is a staple of philosophy. The next section surveys a few of its philosophical applications.
3.1
Conditional probability in the philosophy of probability
A central issue in the philosophical foundations of probability is that of interpreting probability — that is, of analysing or explicating the ‘P ’ that appears in its formal theory. Conditional probability finds an important place in all of the leading interpretations: Frequentism: Probability is understood as relative frequency (perhaps in an infinite sequence of hypothetical trials) — e.g. the probability of heads for a coin is identified with the number of heads outcomes divided by the total number of trials in some suitable sequence of trials. Recalling our third justification for the ratio formula in §2.2, this seems to be naturally understood as a conditional probability, the condition being whatever determines the suitability of that sequence. Propensity: Probability is a measure of the tendency for a certain kind of experimental set-up to produce a particular outcome, either in the single case [Giere, 1973], or in the long run [Popper, 1959a]. Either way, it is a conditional probability, the condition being a specification of the experimental set up. Classical: Probability is assigned by one in an epistemically neutral position with respect to a set of “equally possible” cases — outcomes on which one’s evidence bears equally. Such an assignment must thus be relativised to such evidence. Logical: Probability is a measure of inductive support or partial entailment, generalizing both deductive logic’s notion of entailment and the classical interpretation’s assignments to “equally possible” cases. In Carnap’s notation, c(h, e) is a measure of the degree of support that evidence e confers on h. This is explicitly a conditional probability. Subjective: Probability is understood as the degree of belief of some agent (typically assumed to be ideally rational). As we have seen, some subjectivists (e.g. Jeffreys, de Finetti) explicitly regarded subjective conditional probability to be basic. But even subjectivists who regard unconditional probability as basic find an important place for conditional probability. Subjectivists are unified in regarding
Conditional Probability
107
conformity to the probability calculus as a rational requirement on credences. They often add further constraints, couched in terms of conditional probabilities; a number of examples follow. Gaifman [1988] coins the term “expert probability” for a probability assignment that a given agent with subjective probability function P strives to track. We may codify this idea as follows (simplifying his characterization at the expense of some generality): (Expert) P (A|pr(A) = x) = x, for all x such that P (pr(A) = x) > 0. Here pr(A) is the assignment that the agent regards as expert. For example, if you regard the local weather forecaster as an expert, and she assigns probability 0.1 to it raining tomorrow, then you may well follow suit: P (rain |pr(rain) = 0.1) = 0.1. More generally, we might speak of an entire probability function as being such a guide for an agent, over a specified set of propositions — so that (Expert) holds for any choice of A from that set. A universal expert function would guide all of the agent’s probability assignments in this way. van Fraassen [1984; 1995], following Goldstein [1983], argues that an agent’s future probability functions are universal expert functions for that agent — his Reflection Principle: Pt (A|Pt′ (A) = x) = x, for all A and for all x such that Pt (Pt′ (A) = x) > 0, where Pt is the agent’s probability function at time t, and Pt′ her function at later time t′ . The principle encapsulates a certain demand for ‘diachronic coherence’ imposed by rationality. van Fraassen defends it with a ‘diachronic’ Dutch Book argument (one that considers bets placed at different times), and by analogizing violations of it to the sort of pragmatic inconsistency that one finds in Moore’s paradox. We may go still further. There may be universal expert functions for all rational agents. The Principle of Direct Probability regards the relative frequency function as a universal expert function (cf. [Hacking, 1965]). Let A be an event-type, and let relfreq(A) be the relative frequency of A (in some suitable reference class). Then for any rational agent, we have P (A|relfreq(A) = x) = x, for all A and for all x such that P (relfreq(A) = x) > 0. Related, but distinct according to those who do not identify objective chances with relative frequencies, is Lewis’s [1980]: (Principal Principle) P (A|cht (A) = x & E) = x Here ‘P ’ is an ‘initial’ rational credence function (the prior probability function of a rational agent who has acquired no information), A is a proposition, ch t (A) is the chance of A at time t and E is further evidence that may be acquired. In
108
Alan H´ ajek
order for the Principal Principle to be applicable, E cannot be relevant to whether A is true or false, other than by bearing on the chance of A at t; E is then said to be admissible (strictly speaking: with respect to P, A, t, and x). The literature ritually misstates the Principal Principle, regarding ‘P ’ as the credence function of a rational agent quite generally, rather than an ‘initial’ credence function as Lewis explicitly formulated it. Misstated this way, it is open to easy counterexamples in which the agent has information bearing on A that has been incorporated into ‘P ’, although not explicitly written into the condition (in the slot that ‘E’ occupies). Interestingly, admissibility is surely just as much an issue for the other expert principles, yet for some reason hardly discussed outside the Principal Principle literature, where it is all the rage. Finally, some authors impose the requirement of strict coherence on rational agents: such an agent assigns P (H|E) = 1 only if E entails H. See Shimony [1955].
3.2
Some uses of conditional probability in other parts of philosophy
The use of conditional probability in updating rules for credences, and in the semantics of conditionals, has been so important and fertile that I will devote entire sections to them later on (§7 and §9). In the meantime, here are just a few of the myriad applications of conditional probability in various other areas of philosophy. Probabilistic causation A major recent industry in philosophy has been that of providing analyses of causation compatible with indeterminism. At a first pass, we might analyze ‘causation’ as ‘correlation’ — that is, analyze ‘A causes B’ as P (B|A) > P (B|Ac ). This analysis cannot be right. It wrongly classifies spurious correlations and effects of common causes as instances of causation; moreover, it fails to capture the asymmetry of the causal relation. So a number of authors refine the analysis along the following lines (e.g. [Suppes, 1970; Cartwright, 1979; Salmon, 1980; Eells, 1991]): A causes B iff P (B|A ∩ X) > P (B|Ac ∩ X) for every member X of some ‘suitable’ partition. The exact details vary from author to author; what they share is the fundamental appeal to inequalities among conditional probabilities. Reichenbach’s [1956] famous common cause principle is again couched in terms of inequalities among conditional probabilities. The principle asserts that if A and B are simultaneous events that are correlated, then there exists an earlier common
Conditional Probability
109
cause C of A and B, such that for every member X of some ‘suitable’ partition, P (A|C) > P (A|C c ), P (B|C) > P (B|C c ), P (A ∩ X|C) = P (A|C)P (B|C), and P (A ∩ X|C c ) = P (A|C c )P (B|C c ). That is, C is correlated with A and with B, and C screens off A from B (they are independent conditional on C). Bayesian networks We may model a causal network as a directed acyclic graph with nodes corresponding to variables. If one variable directly causes another, we join the corresponding nodes with a directed edge, its arrow pointing towards the ‘effect’ variable. We may naturally employ a genealogical nomenclature. We call the cause the ‘parent’ variable, the effect a ‘child’ variable, and call iterations of these relationships ‘ancestors’ and ‘descendants’ in the obvious way. In a Bayesian network, a probability distribution is assigned across the nodes. The Causal Markov condition is a commonly held assumption about conditional independence relationships. Roughly, it states that any node in a given network is conditionally independent of its non-descendents, given its parents. More formally (with obvious notation): “Let G be a causal graph with vertex set V and P be a probability distribution over the vertices in V generated by the causal structure represented by G. G and P satisfy the Causal Markov Condition if and only if for every W in V, W is independent of V \(Descendants(W ) ∪ Parents(W )) given Parents(W )” [Spirtes et al., 2000, 29]. (“\” denotes set subtraction.) Faithfulness is the converse condition that the set of independence relations derived from the Causal Markov Condition is exactly the set of independence relations that hold for the network. (See [Spirtes et al., 2000; Hausman and Woodward, 1999].) Inductive-statistical explanation: Hempel [1965] regards scientific explanation as a matter of subsuming an explanandum E under a law L, so that E can be derived from L in conjunction with particular facts. He also recognizes a distinctive kind of “inductive-statistical” (IS) explanation, in which E is subsumed under a statistical law, which will take the form of a statement of conditional probability; in this case, E cannot be validly derived from the law and particular facts, but rather is rendered probable in accordance with the conditional probability. Confirmation While ‘correlation is causation’ is an unpromising slogan, ‘correlation is confirmation’ has fared much better. Confirmation is a useful concept, because even
110
Alan H´ ajek
if Hume was right that there are no necessary connections between distinct existences, still it seems there are at least some non-trivial probabilistic relations between them. That’s just what we mean by saying things like ‘B supports A’, or ‘B is evidence for A’, or ‘B is counterevidence for A’, or ‘B disconfirms A’. So, many Bayesians appropriate the unsuccessful first attempt above to analyze causation, and turn it into a far more successful attempt to analyze confirmation — confirmation is positive correlation, disconfirmation is negative correlation, and evidential irrelevance is independence: Relative to probability function P , • E confirms H iff P (H|E) > P (H) • E disconfirms H iff P (H|E) < P (H) • E is evidentially irrelevant to H iff P (H|E) = P (H) Curve-fitting, and the Akaike Information Criterion Scientists are familiar with the problem of fitting a curve to a set of data. Forster and Sober [1994] argue that the real problem is one of trading off verisimilitude and simplicity: for a given set of data points, finding the curve that best balances the desiderata of predicting the points as accurately as possible using a function that has as few parameters as possible, so as not to ‘overfit’ the points. They argue that simplicity should be attributed to families of curves rather than to individual curves. They advocate selecting the family F with the best expected ‘predictive accuracy’, as measured by the Akaike Information Criterion: AIC(F ) =
1 log(P (Data|L(F )) − k), N
where L(F ) is the member of F that fits the data best, and k is the number of adjustable parameters of members of F . Various other approaches to the curve-fitting problem similarly appeal to likelihoods (at least tacitly), and thus to conditional probabilities. They include the Bayesian Information Criterion [BIC; Schwarz, 1978], Minimum Message Length inference [MML; Wallace and Dowe, 1999; Dowe et al., 2007] and Minimum Description Length inference [MDL; Grunwald et al., 2005]. Decision theory Decision theory purports to tell us how an agent’s beliefs and desires in tandem determine what she should do. It combines her utility function and her probability function to give a figure of merit for each possible action, called the expectation, or desirability of that action (rather like the formula for the expectation of a random variable): a weighted average of the utilities associated with each action. In socalled ‘evidential decision theory’, as presented by Jeffrey [1983], the weights are
Conditional Probability
111
conditional probabilities for states, given actions. Let S1 , S2 , ..., Sn be a partition of possible states of the world. The choice-worthiness of action A is given by: V (A) =
X i
u(A&Si )P (Si |A)
And so it goes again — this has just been another sampler. Given the obvious importance of conditional probability in philosophy, it will be worth investigating how secure are its foundations in (RATIO).
4
PROBLEMS WITH THE RATIO ANALYSIS OF CONDITIONAL PROBABILITY
So far we have looked at success stories for the usual understanding of conditional probability, given by (RATIO). We have seen that several different kinds of argument triangulate to it, and that it subserves a vast variety of applications of conditional probability — indeed, this latter fact itself provides a further pragmatic argument in favour of it. But we have not yet reached the end of the story. I turn now to four different kinds of problem for the ratio analysis, each mitigating the arguments in its favour.
4.1
Conditions with probability zero
P (A∩B) P (B)
is undefined when P (B) = 0; the ratio formula comes with the proviso that P (B) > 0. The proviso would be of little consequence if we could be assured that all probability-zero events of any interest are impossible, but as probability textbooks and even Kolmogorov himself caution us, this is not so. That is, we could arguably dismiss probability zero antecedents as ‘don’t cares’ if we could be assured that all probability functions of any interest are regular — that is, they assign probability 0 only to the empty set. But this is not so. Worse, there are many cases of such conditional probabilities in which intuition delivers a clear verdict as to the correct answer, but (RATIO) delivers no verdict at all. Firstly, in uncountable probability spaces one cannot avoid probability zero events that are possible — indeed, we are saddled with uncountably many of them.2 Consider probability spaces with points taken from a continuum. Here is an example originating with Borel: A point is chosen at random from the surface of the earth (thought of as a perfect sphere); what is the probability that it lies in the Western hemisphere, given that it lies on the equator? 1/2, surely. Yet the probability of the condition is 0, since a uniform probability measure over a sphere must award probabilities to regions in proportion to their area, and the equator 2 Here I continue to assume Kolmogorov’s axiomatization, according to which probabilities are real-valued. Regularity may be achieved by the use of infinitesimal probabilities (see e.g. [Skyrms, 1980]); but see [H´ ajek, 2003] for some concerns.
112
Alan H´ ajek
has area 0. The ratio analysis thus cannot deliver the intuitively correct answer. Obviously there are uncountably many problem cases of this form for the sphere. Another class of problem cases arises from the fact that the power set of a denumerable set is uncountable. For example, the set of all possible infinite sequences of tosses of a coin is uncountable (the sets of positive integers that could index the heads outcomes form the power set of the positive integers). Any particular sequence has probability zero (assuming that the trials are independent and identically distributed with intermediate probability for heads). Yet surely various corresponding conditional probabilities are defined — e.g., the probability that a fair coin lands heads on every toss, given that it lands heads on tosses 3, 4, 5, . . . , is 1/4. More generally, the various classic ‘almost sure’ convergence results — the strong law of large numbers, the law of the iterated logarithm, the martingale convergence theorem, etc. — assert that certain convergences take place, not with certainty, but ‘almost surely’. This is not merely coyness, since these convergences may fail to take place — again, genuine possibilities that receive probability 0, and interesting ones at that. The fair coin may land heads on every toss, and it would be no less fair for it. Zero probability events also arise naturally in countable probability spaces if we impose certain symmetry constraints (such as the principle of indifference), and if we are prepared to settle for finite additivity. Following de Finetti [1990], imagine an infinite lottery whose outcomes are equiprobable. Each ticket has probability 0 of winning, although with probability 1 some ticket will win. Again, various conditional probabilities seem to be well-defined: for example, the probability that ticket 1 wins, given that either ticket 1, 2, or 3 wins, is surely 1/3. The problem of zero-probability conditions is not simply an artefact of the mathematics of infinite (and arguably idealized) probability spaces. For even in finite probability spaces, various possible events may receive probability zero. This is most obvious for subjective probabilities, and in fact it happens as soon as an agent updates on some non-trivial information, thus ruling out the complement of that information — e.g., when you learn that the die landed with an odd number showing up, thus ruling out that it landed with an even number showing up. But still it seems that various conditional probabilities with probability-zero conditions can be well-defined — e.g, the probability that the die landed 2, given that it landed 2, is 1. Indeed, it seems that there are some contingent propositions that one is rationally required to assign probability 0 — e.g., ‘I do not exist’. But various associated conditional probabilities may be well-defined nonetheless — e.g. the probability that I do not exist, given that I do not exist, is 1. Perhaps such cases are more controversial. If so, it matters little. As long as there is some case of a well-defined conditional probability with a probability-zero condition, then (RATIO) is refuted as an analysis of conditional probability. (It may nonetheless serve many of our purposes well enough as a partial — that is, incomplete — analysis. See [H´ ajek, 2003] for further discussion.)
Conditional Probability
4.2
113
Conditions with unsharp probability
A number of philosophers and statisticians eschew the usual assumption that probabilities are always real numbers, sharp to infinitely many decimal places. Instead, probabilities may for example be intervals, or convex sets, or sets of real numbers more generally. Such probabilities are given various names: “indeterminate” [Levi, 1974; 2000], “vague” [van Fraassen, 1990], “imprecise” [Walley, 1991], although these words have other philosophical associations that may not be intended here. Maybe it is best to mint a new word for this purpose. I will call them unsharp, marking the contrast to the usual sharp probabilities, while remaining neutral as to how unsharp probabilities should be modelled. What is the probability that the Democrats win the next U.S. election? Plausibly, the answer is unsharp. This is perhaps clearest if the probability is subjective. If you say, for example, that your credence that they win is 0.6, it is doubtful that you really mean 0.60000 . . . , precise to infinitely many decimal places. Now, what is the probability that the Democrats win the next U.S. election, given that they win the next U.S. election? Here the answer is sharp: 1. Or what is the probability that this fair coin will land Heads when tossed, given the Democrats win the next U.S. election? Again, the answer seems to be sharp: 1/2. In H´ajek (2003) I argue that cases like these pose a challenge to the ratio analysis: it seems unable to yield such results. To be sure, perhaps that analysis coupled with suitable devices for handling unsharpness — e.g. supervaluation — can yield the results (although I argue that they risk being circular). Still, the point remains that the ratio analysis cannot be the complete story about conditional probability.
4.3
Conditions with vague probability
A superficially similar, but subtly different kind of case involves conditions with what I will call vague probability. Let us first be clear on the distinction between unsharpness and vagueness in general, before looking at probabilistic cases (this is a reason why I did not adopt the word “vagueness” in the previous section). The hallmark of vagueness is often thought to be the existence of borderline cases. A predicate is vague, we may say, if there are possible individuals that do not clearly belong either to the extension or the anti-extension of the predicate. For example, the predicate “fortyish” is vague, conveying a fuzzy region centered around 40 for which there are borderline cases (e.g. a person who is 43). By contrast, I will think of an unsharp predicate as admitting of a range of possible cases, but not borderline cases. “Forty-something” is unsharp: it covers the range of ages in the interval [40, 50), but any particular person either clearly falls under the predicate or clearly does not. However we characterize the distinction, the phenomena of vagueness and unsharpness appear to be different. I now turn to the problem that vague probability causes for the ratio analysis. Suppose that we run a million-ticket lottery. What is the probability that a large-numbered ticket wins? It is vague what counts as a ‘large number’ — 17 surely doesn’t, 999,996 surely does, but there are many numbers that are not so
114
Alan H´ ajek
easily classified. The probability assignment plausibly inherits this vagueness — it might be, for example, ‘0.3-ish’, again with borderline cases. Now, what is the probability that a large-numbered ticket wins, given that a large-numbered ticket wins? That is surely razor-sharp: 1. As before, the challenge to the ratio analysis is to do justice to these facts.
4.4
Conditions with undefined probability
Finally, we come to what I regard as the most important class of problem cases for (RATIO), for they are so widespread and often mundane. They arise when neither P (A ∩ B) nor P (B) is defined, and yet the probability of A, given B, is defined. Here are two kinds of case, the first more intuitive, the second more mathematically rigorous, both taken from [H´ajek, 2003]. The first involves a coin that you believe to be fair. What is the probability that it lands heads, given that I toss it fairly? 1/2, of course. According to the ratio analysis, it is P (the coin lands heads | I toss the coin fairly), that is, P (the coin lands heads ∩ I toss the coin fairly) . P (I toss the coin fairly) However, these unconditional probabilities may not be defined — e.g. you may simply not assign them values. After some thought, you may start to assign them values, but the damage has already been done; and then again, you may still not do so. In [H´ ajek, 2003] I argue that this ratio may well remain undefined, and I rebut various proposals for how it may be defined after all. The second kind of case involves non-measurable sets. Imagine choosing a point at random from the [0, 1] interval. We would like to model this with a uniform probability distribution, one that assigns the same probability to a given set as it does to any translation (modulo 1) of that set. Assuming the axiom of choice and countable additivity, it can be shown that for any such distribution P there must be sets that receive no probability assignment at all from P — so called ‘nonmeasurable sets’. Let N be such a set. Then P (N ) is undefined. Nonetheless, it is plausible that the probability that the chosen point comes from N , given that it comes from N , is 1; the probability that it does not come from N , given that it comes from N , is 0; and so on. The ratio analysis cannot deliver these results. The coin toss case may strike you as contentious, and the non-measurable case as pathological (although in [H´ ajek, 2003] I defend them against these charges). But notice that many of the paradigmatic applications of conditional probability canvassed in the previous section would seem to go the same way. For example, the Born rule surely should not be understood as assigning a value to a ratio of unconditional probabilities of the form P (measurement outcome Ok is observed ∩ measurement M is performed) . P (measurement M is performed)
Conditional Probability
115
Among other things, the terms in the ratio are clearly not given by quantum mechanics, and may plausibly not be defined at all, involving as they do a tacit quantification over the free actions of an open-ended set of experimenters. To summarize: we have seen four kinds of case in which the ratio analysis appears to run aground: conditional probabilities with conditions whose probabilities are either zero, unsharp, vague, or undefined. Now there is a good sense in which these are problems with unconditional probability in its own right, which I am parlaying into problems for conditional probability. For example, the fact that Kolmogorov’s theory of unconditional probability conflates zero-probability possibilities with genuine impossibilities may seem to be a defect of that theory, quite apart from its consequences for conditional probability. Still, since his theory of conditional probability is parasitic on his theory of unconditional probability, it should come as no surprise that defects in the latter can be exploited to reveal defects in the former. And notice how the problems in unconditional probability theory can be amplified when they become problems in conditional probability theory. For example, the conflation of zero-probability possibilities with genuine impossibilities might be thought of as a minor ‘blurriness in vision’ of probability theory; but it is rather more serious when it turns into problems of outright undefinedness in conditional probability, total blind spots. Here are two ways that one might respond. First, one might preserve the conceptual priority that Kolmogorov gives to unconditional over conditional probability, but seek a more sophisticated account of conditional probability. Second, one might reverse the conceptual order, and regard conditional probability as the proper primitive of probability theory. The next two sections discuss versions of these responses, respectively.
5
KOLMOROGOV’S REFINEMENT: CONDITIONAL PROBABILITY AS A RANDOM VARIABLE
(This section is more advanced, and may be skipped by readers who are more interested in philosophical issues. Its exposition largely follows [Billingsley, 1995]; the ensuing critical discussion is my own.) Kolmogorov went on to give a more sophisticated account of conditional probability as a random variable.
5.1
Exposition
Let the probability space hΩ, F, P i be given. We will interpret P as the credence function of an agent, which assumes the value P (ω) at each point ω ∈ Ω. Fixing A ∈ F, we may define the random variable whose value is: P (A|B) if ω ∈ B, P (A|B c ) if ω ∈ B c .
116
Alan H´ ajek
Think of our agent as about to learn the result of the experiment regarding B, and she will update accordingly. (§7 discusses updating rules in greater detail.) Now generalize from the 2-celled partition {B, B c } to any countable partition {B1 , B2 , . . .} of Ω into F-sets. Let G consist of all of the unions of the Bi ; it is the smallest sigma field that contains all of the Bi . G can be thought of as an experiment. Our agent will learn which of the Bi obtains — that is, the outcome of the experiment — and is poised to update her beliefs accordingly. Fixing A ∈ F, consider the function whose values are: P (A|B1 ) if ω ∈ B1 , P (A|B2 ) if ω ∈ B2 , ... when these quantities are defined. If P (Bi ) = 0, let the corresponding value of the function be chosen arbitrarily from [0, 1], this value constant for all ω ∈ Bi . Call this function the conditional probability of A given G, and denote it P [AkG]. Given the latitude in assigning a value to this function if P (Bi ) = 0, P [AkG] stands for any one of a family of functions on Ω, differing on how this arbitrary choice is made. A specific such function is called a version of the conditional probability. Thus, any two versions agree except on a set of probability 0. Any version codifies all of the agent’s updating dispositions in response to all of the possible results of the experiment. Notice that since any G ∈ G is a disjoint union ∪k Bik , the probability of any set of the form A ∩ G can be calculated by the law of total probability: X P (A|Bik )P (Bik ) (1) P (A ∩ G) = k
We may generalize further to the case where the sigma field G may not necessarily come from a countable partition, as was previously the case. Our agent will learn for each G in G whether ω ∈ G or ω ∈ Gc . Generalizing (1), we would like to be assured of the existence of a function P [AkG] that satisfies the equation: Z P (A ∩ G) = P [AkG]dP for all G ∈ G. G
That assurance is provided by the Radon-Nikodym theorem, which for probability measures ν and P defined on F states: If P (X) = 0 implies ν(X) = 0 then there exists a function f such that Z f dP ν(A) = A
for all A ∈ F. Let ν(G) = P (A ∩ G) for all G ∈ G. Notice that P (G) = 0 implies ν(G) = 0 so the Radon-Nikodym theorem applies: the function P [AkG] that we sought does indeed
Conditional Probability
117
exist. As before, there may be many such functions, differing on their assignments to probability-zero sets; any such function is called a version of the conditional probability. R Stepping back for a moment: G P [AkG]dP is the expectation of the random variable P [AkG], conditional on G, weighted according to the measure P . We have come back full circle to the remark made earlier about the law of total probability: an unconditional probability can be identified with an expectation of probabilities conditional on each cell of a partition, weighted according to the unconditional probabilities of the cells.
5.2
Critical discussion
Kolmogorov’s more sophisticated formulation of conditional probability provides some relief from the problem of conditions with probability zero — there is no longer any obstacle to such conditional probabilities being defined. However, the other three problems for the ratio analysis — conditions with unsharp, vague, or undefined probability — would appear to remain. For the more sophisticated formulation equates a certain integral, in which the relevant conditional probability figures, to the probability of a conjunction; but when this latter probability is either unsharp, vague, or undefined, the analysis goes silent. Moreover, there is further trouble that had no analogue for the ratio analysis, as shown by Seidenfeld, Schervish, and Kadane in their [2001] paper on “regular conditional distributions” — i.e. distributions of the form P [ kA] that we have been discussing. Let P [ kA](ω) denote the regular conditional distribution for the probability space (Ω, B, P ) given the conditioning sub-σ-field A, evaluated at the point ω. Following Blackwell and Dubins [1975], say that a regular conditional distribution is proper at ω if it is the case that whenever ω ∈ A ∈ A, P (AkA)(ω) = 1 The distribution is improper if it is not everywhere proper. Impropriety seems to be disastrous. We may hold this truth to be self-evident: the conditional probability of anything consistent, given itself, should be 1. Indeed, it seems to be about as fundamental fact about conditional probability as there could be, on a par with the fundamental fact in logic that any proposition implies itself. So the possibility of impropriety, however minimal and however localized it might be, is a serious defect in an account of conditional probability. But Seidenfeld et al. show just how striking the problem is. They give examples of regular conditional distributions that are maximally improper. They are cases in which P [AkA](ω) = 0 (as far from the desired value of 1 as can be), and this impropriety holds almost everywhere according to P , so the impropriety is maximal both locally and glob-
118
Alan H´ ajek
ally.3 This is surely bad news for the more sophisticated analysis of conditional probability — arguably fatal. 6
CONDITIONAL PROBABILITY AS PRIMITIVE
A rival approach takes conditional probability P ( , ) as primitive. If we like, we may then define the unconditional probability of a as P (a, T), where T is a logical truth. (We use lower case letters and a comma separating them in keeping with Popper’s formulation, which we will soon be presenting.) Various axiomatizations of primitive conditional probability have been defended in the literature. See Roeper and Leblanc [1999] for an encyclopedic discussion of competing theories of conditional probability, and Keynes [1921], Carnap [1950], Popper [1959b], and H´ajek [2003] for arguments that probability is inherently a two-place function. As is so often the case, their work was foreshadowed by Jeffreys [1939/1961], who axiomatized a comparative conditional probability relation: p is more probable than q, given r. In some ways, the most general of the proposed axiomatizations is Popper’s [1959b], and his system is the one most familiar to philosophers. Renyi’s [1970] axiomatization is undeservedly neglected by philosophers. It closely mimics Kolmogorov’s axiomatization, replacing unconditional with conditional probabilities in natural ways. I regard it as rather more intuitive than Popper’s system. But since the latter has the philosophical limelight, I will concentrate on it here. Popper’s primitives are: (i) Ω, the universal set; (ii) a binary numerical function p( , ) of the elements of Ω; a binary operation ab defined for each pair (a, b) of elements of Ω; a unary operation ¬a defined for each element a of Ω. Each of these concepts is introduced by a postulate (although the first actually plays no role in his theory): Postulate 1. The number of elements in Ω is countable. Postulate 2. If a and b are in Ω, then p(a, b) is a real number, and the following axioms hold: A1.
(Existence) There are elements c and d in Ω such that p(a, b) 6= p(c, d).
A2.
(Substitutivity) If p(a, c) = p(b, c) for every c in Ω, then p(d, a) = p(d, b) for every d in Ω.
A3.
(Reflexivity) p(a, a) = p(b, b).
Postulate 3. If a and b are in Ω, then ab is in Ω; and if c is also in Ω, then the following axioms hold: 3 A necessary condition for this is that the conditioning sub-sigma algebra is not countably generated.
Conditional Probability
B2.
(Monotony) p(ab, c) ≤ p(a, c)
B2.
(Multiplication) p(ab, c) = p(a, bc)p(b, c)
119
Postulate 4. If a is in Ω, then ¬a is in Ω; and if b is also in Ω, then the following axiom holds: C.
(Complementation) p(a, b) + p(¬a, b) = p(b, b), unless p(b, b) = p(c, b) for every c in Ω.
Popper also adds a “fifth postulate”, which may be thought of as giving the definition of absolute (unconditional) probability: Postulate AP. If a and b are in Ω, and if p(b, c) ≥ p(c, b) for every c in Ω, then p(a) = p(a, b). Popper’s axiomatization thus generalizes ordinary probability theory. Intuitively, b can be regarded as a logical truth. Unconditional probability, then, can be regarded as probability conditional on a logical truth. However, a striking fact about the axiomatization is that it is autonomous — it does not presuppose any set-theoretic or logical notions (such as “logical truth”). A function p( , ) that satisfies the above axioms is called a Popper function. A well-known advantage of the Popper function approach is that it allows conditional probabilities of the form p(a, b) to be defined, and to have intuitively correct values, even when the ‘condition’ b has absolute probability 0, thus rendering the usual conditional probability ratio formula inapplicable — we saw examples in §4.1. Moreover, Popper functions can bypass our concerns about conditions with unsharp, vague, or undefined probabilities — the conditional probabilities at issue are assigned directly, without any detour or constraint given by unconditional probabilities. McGee [1994] shows that in an important sense, probability statements cast in terms of Popper functions and those cast in terms of nonstandard probability functions are inter-translatable. If r is a nonstandard real number, let st(r) denote the standard part of r, that is, the unique real number that is infinitesimally close to r. McGee proves the following theorem: If P is a nonstandard-valued probability assignment on a language L for the classical sentential calculus, then the function C : L × L → R given by C(a, b)
P (ab) ), provided P (b) > 0 P (b) 1, otherwise
= st( =
is a Popper function. Conversely, if C is a Popper function, there is a nonstandardvalued probability assignment P such that P (b) = 1 iff C( , b) is the constant function 1
120
Alan H´ ajek
and C(c, b) = st(
P (cb) ) whenever P (b) > 0. P (b)
The arguments adduced in §4 against the ratio analysis of conditional probability indirectly support taking conditional probability as primitive, although they also leave open the viability of some other analysis of conditional probability in terms of unconditional probability. However, there are some considerations that seem to favour the primacy of conditional probability. The conditional probability assignments that I gave in §4’s examples are seemingly non-negotiable. They can, and in some cases must, stand without support from corresponding unconditional probabilities. Moreover, the examples of unsharp, vague, and undefined probabilities suggest that the problem with the ratio analysis is not so much that it is a ratio analysis, but rather that it is a ratio analysis. The problem lies in the very attempt to analyze conditional probabilities in terms of unconditional probabilities at all. It seems that any other putative analysis that treated unconditional probability as more basic than conditional probability would meet a similar fate — as Kolmogorov’s elaboration did. On the other hand, given an unconditional probability, there is always a corresponding conditional probability lurking in the background. Your assignment of 1/2 to the coin landing heads superficially seems unconditional; but it can be regarded as conditional on tacit assumptions about the coin, the toss, the immediate environment, and so on. In fact, it can be regarded as conditional on your total evidence — recall the quotation from de Finetti in the second paragraph of this article. Now, perhaps in very special cases we can assign a probability free of all assumptions — an assignment of 1 to ‘I exist’ may be such a case. But even then, the probability is easily recovered as probability conditional on a logical truth or some other a priori truth. Furthermore, we can be sure that there can be no analogue of the argument that conditional probabilities can be defined even when the corresponding unconditional probabilities are not, that runs the other way. For whenever an unconditional probability P (X) is defined, it trivially equals the conditional probability of X given a logical/a priori truth. Unconditional probabilities are special cases of conditional probabilities. These considerations are supported further by our discussion in §3.1 of how according to the leading interpretations probability statements are always at least tacitly relativised — on the frequency interpretations, to a reference class; on the propensity interpretation, to a chance set-up; on the classical and logical interpretation, to a body of evidence; on the subjective interpretation, to a subject (who has certain background knowledge) at a time, and who may defer to some ‘expert’ (a person, a future self, a relative frequency, a chance). Putting these facts together, we have a case for regarding conditional probability as conceptually prior to unconditional probability. So I suggest that we reverse the traditional direction of analysis: regard conditional probability to be the primitive notion, and unconditional probability as the derivative notion. But I also recommend Kenny Easwaran’s contribution to this volume (“The Varieties
Conditional Probability
121
of Conditional Probability”) for a different perspective. 7
7.1
CONDITIONAL PROBABILITIES AND UPDATING RULES
Conditionalization
Suppose that your degrees of belief are initially represented by a probability function Pinitial ( ), and that you become certain of E (where E is the strongest such proposition). What should be your new probability function Pnew ? The favoured updating rule among Bayesians is conditionalization; Pnew is related to Pinitial as follows: (Conditionalization) Pnew (X) = Pinitial (X|E) (provided Pinitial (E) > 0) Conditionalization is supported by some arguments similar to those that supported the ratio analysis. Firstly, there is case-by-case evidence. Upon receiving the information that the die landed odd, intuition seems to judge that your probability that it landed 5 should be revised to 1/3, just as conditionalization would have it. Similarly for countless other judgments. Secondly, the muddy Venn diagram can now be given a dynamic interpretation: learning that E corresponds to scraping all mud off ¬E. What to do with the mud that remains? It obviously must be rescaled, since it amounts to a total of only Pinitial (E), whereas probabilities must sum to 1. Moreover, since nothing stronger than E was learned, any movements of mud within E seem gratuitous, or even downright unjustified. So our desired updating rule should preserve the profile of mud within E but renormalize it by a factor of 1/Pinitial (E); this is conditionalization. Thirdly, conditionalization is supported by a ‘diachronic’ Dutch Book argument (see [Lewis, 1999]): on the assumption that your updating is rule-governed, you are subject to a Dutch book (with bets placed at different times) if you do not conditionalize. Equally important is the converse theorem [Skyrms, 1987]: if you do conditionalize, then you are immune to such a Dutch Book. Then there are arguments for conditionalization for which there are currently no analogous arguments for (RATIO) — although I suggest that it would be fruitful to pursue such arguments. For example, Greaves and Wallace [2006] offer a “cognitive decision theory”, arguing that conditionalization is the unique updating rule that maximizes expected epistemic utility. However, there are also some sources of suspicion and even downright dissatisfaction about conditionalization. There are apparently some kinds of belief revisions that should not be so modelled. Those involving indexical beliefs are a prime example. I am currently certain that my computer’s clock reads 8:33; and yet by the time I reach the end of this sentence, I find that I am certain that it does not read 8:33. Probability mud isn’t so much scraped away as pushed sideways in such cases. Levi [1980] insists that conditionalization is also not appropriate in cases where an agent “contracts” her “corpus” of beliefs — when her stock of settled assumptions is somehow challenged, forcing her to reduce it. See [Hild, 1998;
122
Alan H´ ajek
Bacchus et al., 1990; Arntzenius, 2003] for further objections to conditionalization. Much as the considerations supporting conditionalization are similar to those supporting the ratio analysis, the considerations counter -supporting the latter counter -support the former. In particular, the objections that I raised in §4 would seem to have force equally against the adequacy of conditionalization. Recall the problem of conditions with probability zero. A point has just been chosen at random from the surface of the earth, and you learn that it lies on the equator. Conditionalization cannot model this revision in your belief state, since you previously gave probability zero to what you learned. (The same would be true of any line of latitude on which you might learn the point to be.) Similarly for your learning that Democrats won the U.S. election; similarly for your learning that a large-numbered ticket was picked in the 1,000,000-ticket lottery; similarly for your learning that I tossed the fair coin; similarly for your learning that a randomly chosen point came from the non-measurable set N . To be sure, the key idea behind conditionalization can be upheld while disavowing the ratio analysis for conditional probability. Upon receiving evidence E, one’s new probability for X should be one’s initial conditional probability for X, given E — this is neutral regarding how the conditional probability should be understood. My point is that the standard formulation of conditionalization, stated above, is not neutral: it presupposes the ratio analysis of conditional probability and inherits its problems. (Recall that P ( | ) is shorthand for a ratio of unconditional probabilities.) Popper functions allow a natural reformulation of updating by conditionalization, so that even items of evidence that were originally assigned such problematic unconditional probabilities by an agent can be learned. The result of conditionalizing a Popper function P ( , ) on a piece of evidence encapsulated by e is P ( , e) — for example, P (a, b) gets transformed to P (a, be).
7.2
Jeffrey conditionalization
Jeffrey conditionalization allows for less decisive learning experiences in which your probabilities across a partition {E1 , E2 , ...} change to {Pnew (E1 ), Pnew (E2 ), ...}, where none of these values need be 0 or 1: X Pinitial (X|Ei )Pnew (Ei ) Pnew (X) = i
[Jeffrey, 1965; 1983; 1990]. Notice that if we replace Pinitial (X|Ei ) by Pnew (X|Ei ), we simply have an instance of the law of total probability. This theorem of the probability calculus becomes a norm of belief revision, assuming that probabilities conditional on each cell of the partition should stay ‘rigid’, unchanged throughout such an experience. Diaconis and Zabell [1982] show, by reasonable criteria for determining a metric on the space of probability functions, that this rule corresponds to updating to the nearest function in that space, subject to the constraints. One might interpret this as capturing a kind of epistemic conservatism in the spirit
Conditional Probability
123
of a Quinean “minimal mutilation” principle: staying as ‘close’ to your original opinions as you can, while respecting your evidence. Jeffrey conditionalization is again supported by a diachronic Dutch book argument [Armendt, 1980]. It should be noted, however, that diachronic Dutch Book arguments have found less favour than their synchronic counterparts. Levi [1991] and Maher [1992] insist that the agent who fails to conditionalize and who thereby appears to be susceptible to a Dutch Book will be able to ‘see it coming’, and thus avoid it; however, see also Skyrms’ [1993] rebuttal. Christensen [1991] denies that the alleged ‘inconsistency’ dramatized in such arguments has any normative force in the diachronic setting. van Fraassen [1989] denies that rationality requires one to follow a rule in the first place. Levi [1967] also criticizes Jeffrey conditionalization directly. For example, repeated operations of the rule may not commute, resulting in a path-dependence of one’s final epistemic state that might be found objectionable. However, Lange [2000] argues that this non-commutativity is a virtue rather than a vice. 8
8.1
SOME PARADOXES AND PUZZLES INVOLVING CONDITIONAL PROBABILITY AND CONDITIONALIZATION
The Monty Hall problem
Let’s begin with a problem that is surely not a paradox, even though it is often called that. You are on the game show Let’s Make a Deal hosted by Monty Hall. Before you are three doors; behind exactly one of them is a prize, which you will win if you choose its door correctly. First, you are to nominate a door. Monty, who knows where the prize is and will not reveal it, ostentatiously opens another door, revealing it to be empty. He then gives you the opportunity to switch to the remaining door. Should you do so? Many people intuit that it doesn’t matter either way: you’re as likely to win the prize by sticking with your original door as you are by switching. That’s wrong — indeed, you are twice as likely to win by switching than by sticking with your original door. An easy way to see this is to consider the probability of failing to win by switching. The only way you could fail would be if you had initially nominated the correct door — probability 1/3 — and then, unluckily, switched away from it when given the chance. Thus, the probability of winning by switching is 2/3. The reasoning just given is surely too simple to count as paradoxical. But the problem does teach a salutary lesson regarding the importance on conditionalizing on one’s total evidence. The fallacious reasoning would have you conditionalize on the evidence that the prize is not behind the door that Monty actually opens (e.g. door 1) — that is, to assign a probability 1/2 to each of the two remaining doors (e.g. doors 2 and 3). But your actual evidence was stronger than that: you also learned that Monty opened the door that he did. (If you initially chose the correct door, he had a genuine choice.) A relatively simple calculation shows that conditionalizing on your total evidence yields the correct answer: your updated
124
Alan H´ ajek
probability that the remaining door contains the prize is 2/3, so you should switch to it.
8.2
Simpson’s paradox
Again, it is questionable whether an observation due to Simpson deserves to be called a “paradox”; rather, it is a fairly straightforward fact about inequalities among conditional probabilities. But the observation is undoubtedly rather counterintuitive, and it has some significant ramifications for scientific inference. The paradox was once famously instantiated by the U.C. Berkeley’s admission statistics. Taken as a whole, admissions seemed to favour males, as suggested by the correlations inferred from the relative frequencies of admission of males and females: P (admission | male) > P (admission | female). Yet disaggregating the applications department by department, the correlations went the other way: P (admission | male & department 1 applicant) < P (admission | female & department 1 applicant) P (admission | male & department 2 applicant) < P (admission | female & department 2 applicant), and so on for every department. How could this be? A simple explanation was that the females tended to apply to more competitive departments with lower admission rates. This lowered their university-wide admission rate compared to males, even though department by department their admission rate was superior. More generally, Simpson’s paradox is the phenomenon that correlations that appear at one level of partitioning may disappear or even reverse at another level of partitioning: P (E|C) > P (E| ∼ C) is consistent with P (E|C & F1 ) < P (E| ∼ C& F1 ),
P (E|C & F2 ) < P (E| ∼ C & F2 ), ...
P (E|C & Fn ) < P (E| ∼C & Fn ), for some partition {F1 , F2 , . . . , Fn }. Pearl [2000] argues that such a pattern of inequalities only seems paradoxical if we impose a causal interpretation on them. In our example, being male is presumably regarded as a (probabilistic) cause of being admitted, perhaps due to discrimination in favour of men and against women. We seem to be reasoning: “Surely unanimity in the departmental causal facts has to be preserved by the
Conditional Probability
125
university at large!” Pearl believes that if we rid ourselves of faulty intuitions about correlations revealing causal relations, the seeming paradoxicality will vanish. I demur. I think that we are just as liable to recoil even if the data is presented as inequalities among ratios, with no causal interpretation whatsoever. Department by department, the ratio of admitted women is greater than the ratio of admitted men, yet university-wide the inequality among the ratios goes the other way. How could this be? “Surely unanimity in the departmental ratio inequalities has to be preserved by the university at large!” Not at all, as simple arithmetic proves. We simply have faulty arithmetical intuitions.
8.3
The Judy Benjamin problem
The general problem for probability kinematics is: given a prior probability function P , and the imposition of some constraint on the posterior probability function, what should this posterior be? This problem apparently has a unique solution for certain constraints, as we have seen — for example: 1. Assign probability 1 to some proposition E, while preserving the relative odds of all propositions that imply E. Solution: conditionalize P on E. 2. Assign probabilities p1 , ..., pn to the cells of the partition {E1 , ..., En }, while preserving the relative odds of all propositions within each cell. Solution: Jeffrey conditionalize P on this partition, according to the specification. But consider the constraint: 3. Assign conditional probability p to B, given A. The Judy Benjamin problem is that of finding a rule for transforming a prior, subject to this third constraint [van Fraassen, 1989]. van Fraassen provides arguments for three distinct such rules, and surmises that this raises the possibility that such uniqueness results “will not extend to more broadly applicable rules in general probability kinematics. In that case rationality will not dictate epistemic procedure even when we decide that it shall be rule governed” [1989, p. 343].
8.4
Non-conglomerability
Call P conglomerable in the partition X = {x1 , x2 , . . .} if k1 ≤ P (Y ) ≤ k2 whenever k1 ≤ P (Y |X = xi ) ≤ k2 for all i = 1, 2, . . . Here’s the intuitive idea. Suppose that you know now that you will learn which member of a particular partition is true. (A non-trivial partition might have as few as two members, such as {Heads} and {Tails}, or it might have countably many members.) Suppose further that you know now that whatever you learn, your probability for Q will lie in a certain interval. Then it seems that you should
126
Alan H´ ajek
now assign a probability for Q that lies in that interval. If you know that you are going to have a certain opinion in the future, why wait? — Make it your opinion now! More generally, if you know that a credence of yours will be bounded in a particular way in the future, why wait? — Bound that credence in this way now! ‘Conglomerability in a partition’ captures this desideratum. Failures of conglomerability arise when P is finitely additive, but not countably additive. As Seidenfeld et al. [1998] show, in that case there exists some countable partition in which P is not conglomerable. If updating takes place by conditionalization, failures of conglomerability lead to curious commitments reminiscent of violations of the Reflection Principle: “My future self, who is ideally rational and better informed than I am, will definitely have a credence for Q in a particular interval, but my credence for Q is not in this interval.” (See [Jaynes, 2003, Ch. 15] for a critique of Seidenfeld et al. See also Kadane et al. [1986] for a non-conglomerability result even assuming countable additivity, in uncountable partitions.)
8.5
The two-envelope paradoxes
As an example of non-conglomerability, consider the following infinite version of the so-called ‘two envelope’ paradox: Two positive integers are selected at random and turned into dollar amounts, the first placed in one envelope, the second placed in another, whose contents are concealed from you. You get to choose one envelope, and its contents are yours. Suppose that following de Finetti [1972], and in violation of countable additivity, you assign probability 0 to all finite sets of positive integers, but (of course), probability 1 to the entire set of positive integers. Let X be the amount in your envelope and Y the amount in the other envelope. Then very reasonably you assign: P (X < Y ) = 1/2. But suppose now that we let you open your envelope. You may see $1, or $2, or $3, or . . . Yet whatever you see, you will want to switch to holding the other envelope, for P (X < Y |X = x) = 1 for x = 1, 2, . . . Why wait? Since you know that you will want to switch, you should switch now. That is absurd: you surely cannot settle from the armchair that you have made the wrong choice, however you choose. A better-known version of the two-envelope paradox runs as follows. One positive integer is selected and that number of dollars is placed in an envelope. Twice as much is placed in another envelope. The contents of both envelopes are concealed from you. You get to choose one envelope, and its contents are yours. At first you think that you have no reason to prefer one envelope over another, so you choose one. But as soon as you do, you feel regret. You reason as follows: “I am holding some dollar amount — call it n. The other envelope contains either
Conditional Probability
127
2n or n/2, each with probability 1/2. So its expectation is (2n)1/2 + (n/2)1/2 = 5n/4 > n. So it is preferable to my envelope.” This is already absurd, as before. Worse, if we let you switch, your regret will immediately run the other way: “I am holding some dollar amount – call it m . . . ” And similar reasoning seems to go through even if we let you open your envelope to check its contents! Let X be the random variable ‘the amount in your envelope’, and let Y be ‘the amount in the other envelope’. Notice that a key step of the reasoning moves from for any n, E(Y |X = n) > n = E(X|X = n)
(∗)
to the conclusion that the other envelope is preferable. A missing premise is that E(Y ) > E(X). This may seem to follow straightforwardly from (∗). But that presupposes conglomerability with respect to the partition of amounts in your envelope, which is exactly what should be questioned. See [Arntzenius and McCarthy, 1997; Chalmers, 2002] for further discussion. 9
PROBABILITIES OF CONDITIONALS AND CONDITIONAL PROBABILITIES
A number of authors have proposed that there are deep connections between conditional probabilities and conditionals. Ordinary English seems to allow us to shift effortlessly between the two kinds of locutions. ‘The probability of it raining, given that it is cloudy, is high’ seems to say the same thing as ‘the probability of it raining, if it is cloudy, is high’ — the former a conditional probability, the latter the probability of a conditional. The Ramsey test and Adams’ thesis Ramsey [1931/1990, p. 155] apparently generalized this observation in a pregnant remark in a footnote: “If two people are arguing ‘If p will q?’ and are both in doubt as to p, they are adding p hypothetically to their stock of knowledge and arguing on that basis about q; . . . We can say they are fixing their degrees of belief in q given p.” Adams [1975] more explicitly generalized the observation in his celebrated thesis that the probability of the indicative conditional ‘if A, then B’ is given by the corresponding conditional probability of B given A. He denied that such conditionals have truth conditions, so this probability is not to be thought of as the probability that ‘if A, then B’ is true. Further, Adams’ ‘probabilities’ of conditionals do not conform to the usual probability calculus — in particular, Boolean compounds involving them do not receive ‘probabilities’, as the usual closure assumptions (given in §2.1) would require. For this reason, Lewis [1976] suggests that they be called “assertabilities” instead, a practice that has been widely adopted subsequently. Note, however, that
128
Alan H´ ajek
“assertability” seems to bring in the norms of assertion. For example, Williamson [2002] argues that you should only assert what you know ; but then it is hard to make sense of assertability coming in all the degrees that Adams requires of it. And conditionals can be unassertable for all sorts of reasons that seem beside the point here — they can be inappropriate, irrelevant, uninformative, undiplomatic, and so on. This is a matter of the pragmatics of conversation, which is another topic. Perhaps the locution “degrees of acceptability” better captures Adams’ idea. Stalnaker’s Hypothesis Stalnaker [1970], by contrast, insisted that conditionals have truth conditions, and he and Lewis were engaged in the late 60s and early 70s in a famous debate over what they were. In particular, they differed over the status of conditional excluded middle — on whether sentences of the following form are tautologies or not: (CEM)
(A → B) ∨ (A → ¬B)
Stalnaker thought so; Lewis thought not. Stalnaker upheld the equality of genuine probabilities of conditionals with the corresponding conditional probabilities, and used the attractiveness of this thesis as an argument for his preferred semantics. More precisely, the hypothesis is that some suitably quantified and qualified version of the following equation holds: (PCCP) P (A → B) = P (B|A) for all A, B in the domain of P , with P (A) > 0. (“→” is a conditional connective.) Stalnaker’s guiding idea was that a suitable version of the hypothesis would serve as a criterion of adequacy for a truth-conditional account of the conditional. He explored the conditions under which it would be reasonable for a rational agent, with subjective probability function P , to believe a conditional A → B. By identifying the probability of A → B with P (B|A), Stalnaker was able to put constraints on the truth conditions of the ‘→’. In particular, if this identification were sound, it would vindicate conditional excluded middle. For by the probability calculus, P [(A → B) ∨ (A → ¬B)] = P (A → B) + P (A → ¬B) (assuming that the disjuncts are incompatible, as both authors did) = P (B|A) + P (¬B|A) (by the identification of probabilities of conditionals with conditional probabilities) = 1. So all sentences of the CEM form have probability 1, as Stalnaker required. Some of the probabilities-of-conditionals literature is rather unclear on exactly what claims are under discussion: what the relevant quantifiers are, and their
Conditional Probability
129
domains of quantification. With the above motivations kept in mind, and for their independent interest, we now consider four salient ways of rendering precise the hypothesis that probabilities of conditionals are conditional probabilities: Universal version: There is some → such that for all P , (PCCP) holds. Rational Probability Function version: There is some → such that for all P that could represent a rational agent’s system of beliefs, (PCCP) holds. Universal Tailoring version: For each P there is some → such that (PCCP) holds. Rational Probability Function tailoring version: For each P that could represent a rational agent’s system of beliefs, there is some → such that (PCCP) holds.’indexorthogonal Can any of these versions be sustained? The situation is interesting however we answer this question. If the answer is ‘no’, then seemingly synonymous locutions are not in fact synonymous: surprisingly, ‘the probability of B, given A’ does not mean the same thing as ‘the probability of: B if A’. If the answer is ‘yes’, then important links between logic and probability theory will have been established, just as Stalnaker and Adams hoped. Probability theory would be a source of insight into the formal structure of conditionals. And probability theory in turn would be enriched, since we could characterize more fully what the usual conditional probability ratio means, and what its use is. de Finetti [1972] laments that (RATIO) gives the formula, but not the meaning, of conditional probability. A suitably quantified hypothesis involving (PCCP) could serve to characterize more fully what the ratio means, and what its use is. There is now a host of results — mostly negative — concerning PCCP. We will give a sample of some of the most important ones. We will then be in a position to assess how the four versions of the hypothesis fare, and what the prospects are for other versions. Some preliminary definitions will assist in stating the results. If (PCCP) holds, we will say that → is a PCCP-conditional for P, and that P is a PCCP-function for →. If (PCCP) holds for a particular → for each member P of a class of probability functions P, we will say that → is a PCCP-conditional for P. A pair of probability functions P and P ′ are orthogonal if, for some A, P (A) = 1 but P ′ (A) = 0. (Intuitively, orthogonal probability functions concentrate their probability on entirely non-intersecting sets of propositions.) Call a proposition A a P-atom iff P (A) > 0 and, for all X, either P (AX) = P (A) or P (AX) = 0. (Intuitively, a P -atom is a proposition that receives an indivisible ‘blob’ of probability from P .) Finally, we will call a probability function trivial if it has at most 4 different values. Most of the negative results are ‘triviality results’: given certain assumptions, only trivial probability functions can sustain PCCP. Moreover, most of them make no assumptions about the logic of the ‘→’ — it is simply a two-place connective. The earliest and most famous results are due to Lewis [1976]: First triviality result: There is no PCCP-conditional for the class of all probability functions.
130
Alan H´ ajek
Second triviality result: There is no PCCP-conditional for any class of probability functions closed under conditionalizing, unless the class consists entirely of trivial functions. Lewis [1986] strengthens these results: Third triviality result: There is no PCCP-conditional for any class of probability functions closed under conditionalizing restricted to the propositions in a single finite partition, unless the class consists entirely of trivial functions. Fourth triviality result: There is no PCCP-conditional for any class of probability functions closed under Jeffrey conditionalizing, unless the class consists entirely of trivial functions. These results refute the Universal version of the hypothesis. They also spell bad news for the Rational Probability Function version, for even if rationality does not require updating by conditionalizing, or Jeffrey conditionalizing, it seems plausible that it at least permits such updating. This version receives its death blow from the following result by Hall [1994], that significantly strengthens Lewis’ results: Orthogonality result: Any two non-trivial PCCP-functions defined on the same algebra of propositions are orthogonal. It follows from this that the Rational Probability Function version is true only if any two distinct rational agents’ probability functions are orthogonal — which is absurd. So far, the ‘tailoring’ versions remain unscathed. The Universal Tailoring version is refuted by the following result due to H´ajek [1989; 1993], which concerns probability functions that assume only a finite number of distinct values: Finite-ranged Functions Result: Any non-trivial probability function with finite range has no PCCP-conditional. This result also severely casts doubt on the Rational Probability Tailoring version, for it is hard to see why rationality requires one to adopt a probability function with infinite range. The key idea behind this result can be understood by considering a very simple case. Consider a three-ticket lottery, and let Li = ‘ticket i wins’, i = 1, 2, 3. Let P assign probability 1/3 to each of the Li . Clearly, some conditional probabilities take the value 1/2 — for example, P (L1 |L1 ∨ L2 ). But no unconditional probability can take this value, being constrained to be a multiple of 1/3; a fortiori, no (unconditional) probability of a conditional can take this value. The point generalizes to all finite-ranged probability functions: there will always be some value of the conditional probability function that finds no match among the unconditional probabilities, and a fortiori no match among the (unconditional) probabilities of conditionals. Picture a dance at which, for a given finite-ranged probability function, all of the probability-of-a-conditional values line up along one wall, and all of the conditional probability values line up along the opposite wall.
Conditional Probability
131
Now picture each conditional probability value attempting to partner up with a probability of a conditional with the same value on the other side. According to Stalnaker’s hypothesis, the dance would always be a complete success, with all the values finding their matches; the Finite-ranged Functions Result shows that the dance can never be a complete success. There will always be at least one wallflower among the conditional probabilities, which will have to sit out the dance — for example, 1/2 in our lottery case. If we make a minimal assumption about the logic of the →, matters are still worse thanks to another result of Hall’s [1994]: No Atoms Result: Let the probability space hΩ, F, P i be given, and suppose that PCCP holds for this P , and a ‘→’ that obeys modus ponens. Then hΩ, F, P i does not contain a P -atom, unless P is trivial. It follows from this, on pain of triviality, that the range of P , and hence Ω and F, are non-denumerable. All the more, it is hard to see how rationality requires this of an agent’s probability space. It seems, then, that all four versions of the hypothesis so far considered are untenable. (See also [H´ ajek, 1994] for more negative results.) For all that has been said so far, though, some suitably restricted ‘tailoring’ version might still survive. A natural question, then, is whether even Hall’s ‘no atoms’ result can be extended — whether even uncountable probability spaces cannot support PCCP, thus effectively refuting any ‘tailoring’ version of the hypothesis. The answer is ‘no’ — and here we have a positive result due to van Fraassen [1976]. Suppose that → has this much logical structure: (i) [(A → B) ∩ (A → C)] = [A → (B ∩ C)] (ii) [(A → B) ∪ (A → C)] = [A → (B ∪ C)] (iii) [A ∩ (A → B)] = (A ∩ B) (iv) [A → A] = Ω. Such an → conforms to the logic CE. van Fraassen shows: CE tenability result: Any probability space can be extended to one for which PCCP holds, with an → that conforms to CE. Of course, the larger space for which PCCP holds is uncountable. In the same paper, van Fraassen also shows that → can have still more logical structure, while supporting PCCP, provided we restrict the admissible iterations of → appropriately. A similar strategy of restriction protects Adams’ version of the hypothesis from the negative results. He applies PCCP to unembedded conditionals — ‘simple’ conditionals of the form A → B, where A and B are themselves conditionalfree. As mentioned before, Adams does not allow the assignment of probabilities to Boolean compounds involving conditionals; ‘P ’ is thus not strictly speaking
132
Alan H´ ajek
a probability function (and thus the negative results, which presuppose that it is, do not apply). McGee [1989] shows how Adams’ theory can be extended to certain more complicated compounds of conditionals, while still falling short of full closure. 10
CONCLUSION
This survey has been long, and yet still I fear that some readers will be disappointed that I have not discussed adequately, or at all, their favourite application of or philosophical issue about conditional probability. They may find some solace in the lengthy bibliography that follows. Along the way we have seen some reasons for questioning the orthodoxy enshrined in Kolmogorov’s ratio analysis; moreover his more sophisticated formulation seems not entirely successful either. I have argued that we should take conditional probability as the primitive notion in probability theory, although this still remains a minority position. However we resolve this issue, we have something of a mathematical and philosophical balancing act: finding an entirely satisfactory mathematical and philosophical theory of conditional probability that as much as possible does justice to our intuitions and to its various applications. It is an act worth getting right: the foundations of probability theory depend on it, and thus any theory that employs probability theory depends on it also — which is to say, any serious empirical discipline, and much of philosophy.4 BIBLIOGRAPHY [Adams, 1975] E. Adams. The Logic of Conditionals, Reidel, 1975. [Armendt, 1980] B. Armendt. Is There a Dutch Book Argument for Probability Kinematics? Philosophy of Science 47, No. 4 (December), 583-88, 1980. [Arntzenius, 2003] F. Arntzenius. Some Problems for Conditionalization and Reflection, Journal of Philosophy Vol. C, No 7, 356-371, 2003. [Arntezenius and McCarthy, 1997] F. Arntzenius and D. McCarthy. The Two Envelope Paradox and Infinite Expectations, Analysis 57, No. 1, 28-34, 1997. [Bacchus et al., 1990] F. Bacchus, H. E. Kyburg Jr., and M. Thalos. Against Conditionalization, Synthese 85, 475-506, 1990. [Billingsley, 1995] P. Billingsley. Probability and Measure, John Wiley and Sons, Third Edition, 1995. [Blackwell and Dubins, 1975] D. Blackwell and L. Dubins. On Existence and Non-existence of Proper, Regular, Conditional Distributions, The Annals of Probability v3 i5, 741-752, 1975. [Carnap, 1950] R. Carnap. Logical Foundations of Probability, Chicago: University of Chicago Press, 1950. [Carnap, 1952] R. Carnap. The Continuum of Inductive Methods, Chicago: The University of Chicago Press, 1952. [Cartwright, 1979] N. Cartwright. Causal Laws and Effective Strategies, Noˆ us 13, 419-437, 1979. 4 I am grateful to Elle Benjamin, Darren Bradley, Lina Eriksson, Marcus Hutter, Aidan Lyon, John Matthewson, Ralph Miles, Nico Silins, Michael Smithson, Weng Hong Tang, Peter Vanderschraaf, Wen Xuefeng, and especially John Cusbert, Kenny Easwaran and Michael Titelbaum for helpful suggestions.
Conditional Probability
133
[Chalmers, 2002] D. Chalmers. The St. Petersburg Two-Envelope Paradox, Analysis 62, 155-57, 2002. [Christensen, 1991] D. Christensen. Clever Bookies and Coherent Beliefs, The Philosophical Review C, No. 2, 229-247, 1991. [de Finetti, 1937] B. de Finetti. La Pr´ evision: Ses Lois Logiques, Ses Sources Subjectives, Annales de l’Institut Henri Poincar´ e, 7: 1-68, 1937. Translated as Foresight. Its Logical Laws, Its Subjective Sources, in Studies in Subjective Probability, H. E. Kyburg, Jr. and H. E. Smokler (eds.), Robert E. Krieger Publishing Company, 1980. [de Finetti, 1972] B. de Finetti. Probability, Induction and Statistics, Wiley, 1972. [de Finetti, 1974/1990] B. de Finetti. Theory of Probability, Vol. 1. Chichester: Wiley Classics Library, John Wiley & Sons, 1974/1990. [Diaconis and Zabell, 1982] P. Diaconis and S. L. Zabell. Updating Subjective Probability, Journal of the American Statistical Association 77, 822-30, 1982. [Dowe et al., 2007] D. Dowe, S. Gardner, and G. Oppy. Bayes not Bust! Why Simplicity is No Problem for Bayesians, British Journal for Philosophy of Science 58, 4, 709-54, 2007. [Eells, 1991] E. Eells. Probabilistic Causality, Cambridge University Press, Cambridge, 1991. [Eells and Skyrms, 1994] E. Eells and B. Skyrms, eds. Probability and Conditionals, Cambridge University Press, 1994. [Forster and Sober, 1994] M. Forster and E. Sober. How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions, British Journal for the Philosophy of Science 45, 1-35, 1994. [Gaifman, 1988] H. Gaifman. A Theory of Higher Order Probabilities, in Causation, Chance, and Credence, Brian Skyrms and William L. Harper, eds., Dordrecht: Kluwer Academic Publishers, 1988. [Giere, 1973] R. Giere. Objective Single-Case Probabilities and the Foundations of Statistics, in Logic, Methodology and Philosophy of Science IV – Proceedings of the Fourth International Congress for Logic, P. Suppes, L. Henkin, A. Joja, and G. Moisil (eds.), New York: NorthHolland, 1973. [Goldstein, 1983] M. Goldstein. The Prevision of a Prevision, Journal of the American Statistical Association 78, 817-819, 1983. [Greaves and Wallace, 2006] H. Greaves and D. Wallace. Justifying Conditionalization: Conditionalization Maximizes Expected Epistemic Utility, Mind 115 (459), 607-632, 2006. [Grunwald et al., 2005] P. Grunwald, M. A. Pitt, and I. J. Myung, eds. Advances in Minimum Description Length: Theory and Applications, Cambridge, MA: MIT Press, 2005. [Hacking, 1965] I. Hacking. The Logic of Statistical Inference, Cambridge: Cambridge University Press, 1965. [H´ ajek, 1989] A. H´ ajek. Probabilities of Conditionals — Revisited, Journal of Philosophical Logic 18, No. 4, 423-428, 1989. [H´ ajek, 1993] A. H´ ajek. The Conditional Construal of Conditional Probability, Ph.D. Dissertation, Princeton University. Available at http://fitelson.org/conditionals/hajek_ dissertation.pdf [H´ ajek, 1994] A. H´ ajek. Triviality on the Cheap?, in Eells and Skyrms, 113-140, 1994. [H´ ajek, 2003] A. H´ ajek. What Conditional Probability Could Not Be, Synthese 137, No. 3, (December), 273-323, 2003. [Hall, 1994] N. Hall. Back in the CCCP, in Eells and Skyrms, 141-60, 1994. [Hausman and Woodward, 1999] D. Hausman and J. F. Woodward. Independence, Invariance and the Causal Markov Condition, The British Journal for the Philosophy of Science 50: 44, 521-583, 1999. [Hempel, 1965] C. Hempel. Aspects of Scientific Explanation and Other Essays in the Philosophy of Science, New York: Free Press, 1965. [Hild, 1998] M. Hild. The Coherence Argument Against Conditionalization, Synthese 115, 229258, 1998. [Jaynes, 2003] E. T. Jaynes. Probability Theory: The Logic of Science, Cambridge: Cambridge University Press, 2003. [Jeffrey, 1965/1983/1990] R. C. Jeffrey. The Logic of Decision, Chicago: Chicago University Press, 1965/1983/190. [Jeffreys, 1961] H. Jeffreys. Theory of Probability, Oxford: Oxford University Press (originally published in 1939, and now in the Oxford Classic Texts in the Physical Sciences series), 1961. [Johnson, 1921] W. E. Johnson. Logic, Cambridge: Cambridge University Press, 1921.
134
Alan H´ ajek
[Keynes, 1921] J. M. Keynes. A Treatise on Probability, London: Macmillan, 1921. [Kadane et al., 1986] J. B. Kadane, M. J. Schervish and T. Seidenfeld. Statistical Implications of Finitely Additive Probability, in Bayesian Inference and Decision Techniques, Prem K. Goel and Arnold Zellner (eds.), Amsterdam: Elsevier Science Publishers, 59-76, 1986. [Kolmogorov, 1933/1950] A. N. Kolmogorov. Grundbegriffe der Wahrscheinlichkeitrechnung, Ergebnisse Der Mathematik, 1933. Translated as Foundations of Probability. New York: Chelsea Publishing Company, 1950. [Lange, 2000] M. Lange. Is Jeffrey Conditionalization Defective by Virtue of Being NonCommutative? Remarks on the Sameness of Sensory Experiences, Synthese 123, 393-403, 2000. [Levi, 1967] I. Levi. Probability Kinematics, British Journal for the Philosophy of Science 18, 197-209, 1967. [Levi, 1974] I. Levi. On Indeterminate Probabilities, Journal of Philosophy 71, 391-418, 1974. [Levi, 1980] I. Levi. The Enterprise of Knowledge, Cambridge: MIT Press. 1980. [Levi, 1991] I. Levi. Consequentialism and Sequential Choice, in Michael Bacharach and Susan Hurley (eds.), Foundations of Decision Theory, Oxford: Basil Blackwell, 92-146, 1991. [Levi, 2000] I. Levi. Imprecise and Indeterminate Probabilities, Risk, Decision and Policy 5, 111-122, 2000. [Lewis, 1973] D. Lewis. Counterfactuals, Blackwell and Harvard University Press, 1973. [Lewis, 1976] D. Lewis. Probabilities of Conditionals and Conditional Probabilities, Philosophical Review 85, 297-315, 1976. [Lewis, 1980] D. Lewis. A Subjectivist’s Guide to Objective Chance, in Studies in Inductive Logic and Probability, Vol II., University of California Press, 263-293, 1980; reprinted in Philosophical Papers Volume II, 1986, Oxford: Oxford University Press. [Lewis, 1986] D. Lewis. Probabilities of Conditionals and Conditional Probabilities II, Philosophical Review 95, 581-589, 1986. [Lewis, 1999] D. Lewis. Papers in Metaphysics and Epistemology, Cambridge: Cambridge University Press, 1999. [Maher, 1992] P. Maher. Diachronic Rationality, Philosophy of Science 59, 120-141, 1992. [Maher, 2007] P. Maher. Explication Defended, Studia Logica 86, 331-341, 2007. [McGee, 1989] V. McGee. Conditional Probabilities and Compounds of Conditionals, The Philosophical Review 98, 485-541 1989. [McGee, 1994] V. McGee. Learning the Impossible, in Eells and Skyrms, 179-99, 1994. [Pearl, 2000] J. Pearl. Causality: Models, Reasoning, and Inference, Cambridge: Cambridge University Press, 2000. [Popper, 1959a] K. Popper. The Propensity Interpretation of Probability, British Journal of the Philosophy of Science 10, 25–42, 1959. [Popper, 1959b] K. Popper. The Logic of Scientific Discovery, Basic Books, 1959. [Ramsey, 1931] F. P. Ramsey. General Propositions and Causality, in R. B. Braithwaite, ed., The Foundations of Mathematics and Other Logical Essays, Routledge; also in Philosophical Papers, ed. D. H. Mellor, Cambridge: Cambridge University Press, 1931. [Reichenbach, 1956] H. Reichenbach. The Direction of Time, Berkeley, University of California Press, 1956. [Renyi, 1970] A. Renyi. Foundations of Probability, Holden-Day, Inc. 1970. [Roeper and Leblanc, 1999] P. Roeper and H. Leblanc. Probability Theory and Probability Logic, Toronto: University of Toronto Press, 1999. [Salmon, 1980] W. Salmon. Probabilistic Causality, Pacific Philosophical Quarterly 61, 50-74, 1980. [Schwarz, 1978] G. Schwarz. Estimating the dimension of a model, Annals of Statistics 6 (2), 461-464, 1978. [Seidenfeld et al., 1998] T. Seidenfeld, M. J. Schervish, and J. B. Kadane. “NonConglomerability for Finite-Valued, Finitely Additive Probability”, Sankhya Series A, Vol. 60, No. 3, 476-491. 1998. [Seidenfeld et al., 2001] T. Seidenfeld, M. J. Schervish, and J. B. Kadane. Improper Regular Conditional Distributions, The Annals of Probability 29, No. 4, 1612-1624, 2001. [Shimony, 1955] A. Shimony. Coherence and the Axioms of Confirmation, Journal of Symbolic Logic 20, 1-28, 1955. [Skyrms, 1980] B. Skyrms. Causal Necessity, Yale University Press, 1980.
Conditional Probability
135
[Skyrms, 1987] B. Skyrms. Dynamic Coherence and Probability Kinematics, Philosophy of Science 54, No. 1 (March), 1-20, 1987. [Skyrms, 1993] B. Skyrms. A Mistake in Dynamic Coherence Arguments? Philosophy of Science 60, 320-328, 1993. [Spirtes et al., 2000] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search, 2nd ed. New York, N.Y.: MIT Press, 2000. [Stalnaker, 1968] R. Stalnaker. A Theory of Conditionals, Studies in Logical Theory, American Philosophical Quarterly Monograph Series, No. 2, Oxford: Blackwell, 1968. [Stalnaker, 1970] R. Stalnaker. Probability and Conditionals, Philosophy of Science 37, 64–80, 1970. [Suppes, 1970] P. Suppes. A Probabilistic Theory of Causality, Amsterdam: North Holland Publishing Company, 1970. [van Fraassen, 1976] B. van Fraassen. Probabilities of Conditionals, in Harper and Hooker (eds.), Foundations of Probability Theory, Statistical Inference and Statistical Theories of Science, Vol. I, Reidel, 261-301, 1976. [van Fraassen, 1984] B. van Fraassen. Belief and the Will, Journal of Philosophy 81, 235-256, 1984. [van Fraassen, 1989] B. van Fraassen. Laws and Symmetry, Oxford: Clarendon Press, 1989. [van Fraassen, 1990] B. van Fraassen. Figures in a Probability Landscape, in J. M. Dunn and A. Gupta (eds.), Truth or Consequences, Kluwer, 1990. [van Fraassen, 1995] B. van Fraassen. Belief and the Problem of Ulysses and the Sirens, Philosophical Studies 77, 7-37, 1995. [Wallace and Dowe, 1999] C. S. Wallace and D. L. Dowe. Minimum Message Length and Kolmogorov Complexity, The Computer Journal 42, No. 4, (special issue on Kolmogorov complexity), 270-283, 1999. [Walley, 1991] P. Walley. Statistical Reasoning with Imprecise Probabilities, London: Chapman Hall, 1991. [Williamson, 2002] T. Williamson. Knowledge and Its Limits, Oxford: Oxford University Press, 2002.
This page intentionally left blank
THE VARIETIES OF CONDITIONAL PROBABILITY Kenny Easwaran Alan H´ ajek discusses many interesting features and applications of the notion of conditional probability. In addition to this discussion, he also gives arguments for some specific philosophical views on the nature of conditional probability. In general I agree with most of his points, and in particular with his arguments that Kolmogorov’s “ratio analysis” of conditional probability is not correct. However, he suggests three more points that I would like to contest — that there is a single correct analysis of conditional probability across interpretations, that conditional probability is in general a more fundamental notion than unconditional probability, and that Popper’s account of conditional probability is the correct one.. I discuss all of these issues (and especially the third) in my dissertation [Easwaran, 2008], but here I will focus more on the first two. I will argue in section 1 that although the different interpretations of probability have many similarities, this similarity is not exact, even at the purely formal level. In particular, I will argue that the bearers of probability in the propensity, subjective, and logical interpretations of probability are distinct classes of objects, and that this changes the mathematical relations that must hold between conditional and unconditional probabilities. As for H´ ajek’s argument that conditional probabilities are fundamental, in section 2 I will demonstrate a distinction that I think his argument collapses. 1
PLURALISM ABOUT CONDITIONAL PROBABILITY
The fact that a single word, “probability”, is applied to many diverse phenomena, and the existence of a branch of mathematics called “probability theory” suggests that these diverse phenomena can all be understood as applications of the same formal system. However, there are many different formal systems that all claim to describe probability — for instance, both [Kolmogorov, 1950] and [R´enyi, 1970] take the objects of the probability function to be sets, although only the latter allows for conditioning on events of probability zero; the system described in [Hailperin, 1984; Hailperin, 1997; Hailperin, 2000; Hailperin, 2006] and that of [Popper, 1959] both allow for the objects not to be sets, but only the latter takes conditional probability as fundamental. Because of this variety of available systems, it seems that an argument is needed to show that different interpretations of probability should use the same one.1 1 There are some terminological issues here — some suggest that what it means to be an interpretation of probability requires satisfying some specific set of mathematical axioms. [Eells,
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
138
Kenny Easwaran
At first it might seem that principles like the “Principal Principle” [Lewis, 1980] give the required link to show that the different interpretations should use the same formalism. However, such arguments are not definitive. A standard, simplified version of the Principal Principle states that CR(A|CH(A) = x) = x, where CR is an agent’s credence function, and CH is the objective chance function. As stated, this principle requires that the objects of the two functions are the same sort of thing (since A is an object of both). It also gives conditions (when the agent knows all the chances) under which the two functions are apparently identical, so there should be no general axioms for subjective probability that rule out any function that satisfies the axioms for objective chance. However, phrasing the principle more carefully, we will be able to see that these two points are not actually correct. First, Paul Humphreys has argued that the objects of the chance function are events, rather than sentences or sets of worlds. [Humphreys, 2004, p. 669] Even if this is right, a version of the Principal Principle can be stated that is compatible with the arguments of the credence function being sets, as Kolmogorov suggested. Consider the set of all epistemic possibilities for an agent.2 Let [A] be the set of epistemic possibilities according to which the physical event A occurs. Then the Principal Principle can be reformulated by saying that CR([A]|CH(A) = x) = x, so that the different functions can still be defined over different objects. The principle only requires that there be some way of associating some objects of the chance function and some objects of the credence function, not that the objects actually be the same things. Further, although the Principal Principle may entail that there be no constraints on unconditional credences that don’t also apply to chances, it doesn’t directly relate conditional chances to credences, so there is still room for the conditional notions to satisfy different mathematical constraints.3 Although in many cases the values of conditional probabilities can be derived from the values of unconditional probabilities, this is not true when unconditional probabilities take the value 0, which is a case that H´ ajek and I agree is often interesting. And as I will suggest later, conditional chances may not be directly constrained by the unconditional ones even in cases where they are all non-zero. Similar moves can be made for other principles that link different interpretations of probability together — these principles don’t guarantee that there is a single unified mathematical theory that all interpretations must follow. 1983] I just take the term “interpretation of probability” to apply to the notions that have historically been considered paradigmatic cases. Since my arguments suggest that there is no unified formalism that applies to all of them, this will mean that most of them won’t really count as interpretations of probability in this other sense. 2 These may be possible worlds, impossible worlds, maximal consistent sets of sentences, or some other sort of entities. 3 [Lewis, 1994] gives a “New Principal Principle” that suggests equating conditional probabilities in some circumstances as well. But on both principles it is only under special conditions that the functions must agree. This suggests that the functions may disagree and have different formalisms in other conditions, though I will not investigate this possibility here.
The Varieties of Conditional Probability
139
Thus, the primary objection to the possibility of different formal accounts of different interpretations of probability is incorrect — the principles connecting different interpretations of probability have natural modifications that allow the objects of the functions to be different, and allow different relations between conditional and unconditional probability to hold in each. In the rest of this section I will consider several different interpretations of probability and argue for each that it differs from the standard mathematical account of probability given in the first few chapters of [Kolmogorov, 1950], on which probability is a countably-additive, non-negative, normalized function on an algebra of sets, and the conditional probability P (A|B) is given by P (A ∩ B)/P (B), whenever P (B) has a precise non-zero value. Since they differ from this account in different ways, there can be no unified account of all of them that is complete for any one.
1.1
Propensity
The first interpretation of probability I will consider is H´ajek’s “propensity interpretation”. (I will use the terms “objective chance” and “propensity” interchangeably.) I will argue that the standard account of conditional probability is incorrect here. H´ ajek gives some brief arguments in Section 2.2 that various interpretations of probability must satisfy P (A|B) = P (A∩B)/P (B), and then raises problems in Section 4 in which P (B) is zero, imprecise, vague, or undefined. However, none of the arguments he presents in favor of the ratio account applies specifically to the propensity interpretation. I will suggest that there may be reason to think that there are further failures of the ratio account for the propensity interpretation, even when P (B) is well-defined and non-zero. The beginnings of this criticism come from [Humphreys, 1985], which argues that conditional chances don’t obey Bayes’ Theorem. This claim is known as “Humphreys’ Paradox.” The idea is that if B comes causally before A, then P (A|B) makes sense in a standard way, while P (B|A) should either be 0, P (B), undefined, or otherwise different from the value P (A|B)P (B)/P (A). Several objections to Humphreys’ argument are raised in [McCurdy, 1996] and [Gillies, 2000], but there are responses in [Humphreys, 2004]. Regardless of the status of that argument however, I think it brings up an interesting point about the interpretation of conditional probability, when probability is interpreted as objective chance, or propensity. Under most accounts of this interpretation, this means that the probability function measures some sort of disposition. If this is right, then we might think of the distinction between conditional and unconditional chances as being parallel to a distinction between dispositions that are triggered by some external conditions, and those that have some tendency to manifest regardless of the conditions. However, some responses to Humphreys’ Paradox rely on interpreting the conditional probability P (A|B) differently, as a measure of the system’s unconditional disposition to produce both A and B simultaneously, out of the cases where it happens to produce B. This interpretation (which Humphreys calls the “co-production” interpretation) effec-
140
Kenny Easwaran
tively ends up defining conditional chance by the ratio formula, rather than taking conditional chance as a notion in need of its own analysis. For a potential counterexample to the ratio formula, where P (B) is well-defined and positive, but P (A|B) 6= P (A ∧ B)/P (B), consider a fair coin-flipping machine whose trigger is inside a locked room. The easiest way to open the door is to turn on a huge magnet, which unlocks the door, but also prevents the coin from rotating in the air, so that while the magnet is activated, the coin almost always lands heads. Then on the co-production account, the probability of the coin coming up heads given that it is flipped is close to 1, because the propensity of the situation to give rise to a coin landing heads is almost as great as the propensity of the situation to give rise to a coin being flipped at all. However, while the magnet is off and the room is locked, it seems that we may want to say that the propensity for the device to produce a coin landing heads, given that it produces a flip at all, is close to 1/2 — this is exactly what we mean to say that it is a fair coin-flipping machine. This example suggests that the co-production account of conditional propensities gives rise to a notion that is not actually dispositional — it is a parallel to examples of finks that are standard in the literature on dispositions. [Fara, 2006] I suspect that similar examples can be devised to correspond to other problematic cases in the literature on dispositions, like so-called “masks” and “mimics”. To get the behavior I have suggested may be intuitive here, we may need to observe the distinction between intervening and conditioning given in [Meek and Glymour, 1994]. It might be objected to this apparent counterexample that often this sort of non-dispositional behavior is exactly what we want for conditional propensities. For example, [Gillies, 2000, p. 828] includes an example of a barometer, in which it seems that we want a notion of conditional propensity for which the probability of a storm, given that the barometer level drops, is fairly high, even though there is no causal influence of the barometer on the storm. The co-production account gives a high conditional probability in this situation (since most cases in which the barometer level drops are cases in which there is a storm), but the more directly dispositional account Humphreys and I suggest gives a fairly low value (since lowering the barometer level has no tendency to cause storms). But I think it may be that we want such a notion of conditional probability, but it’s not clear that a notion of conditional propensity is needed here. I think all the work of this conditional probability can be done with a notion like conditional degree of belief, even though an important part of the work here will be done by the notion of unconditional propensity (or objective chance). To see how this works, we can observe how the desired conditional probability comes out of the unconditional chances by means of Lewis’ Principal Principle. A simplified version of this principle states that if an agent knows at some time t the chance at that time that some event A will occur, then her degree of belief in that event should equal the chance of that event. In a toy example, we might consider an agent who knows that the chance of a storm is .25, the chance of a drop in the barometer is .25 and the chance of both occurring is .2. Because
The Varieties of Conditional Probability
141
she knows these chances, these will also be her credences. Because conditional credences are generally related to unconditional credences by the ratio formula P (A|B) = P (A ∧ B)/P (B), she will have credence .8 in a storm conditional on a drop in the barometer. The conditional chances never enter into her credence function here, so her reasoning can work just as Gillies wants, without having to accept his argument as saying anything about conditional chances. Now, most actual agents don’t know the precise chances of any of these events, but they do have some ideas about them. All that is needed to make a modified version of this account work is that the agent be fairly confident that the ratio of chance of storm together with drop in barometer to the chance of a drop in the barometer is higher than the chance of a storm. Given that an account of conditional chance doesn’t need to accommodate these evidential correlations, it seems that we can get a more useful and interesting theory by avoiding the co-production interpretation — just as chances are dispositions on the propensity account, conditional chances are conditional dispositions. As Humphreys puts it in his reply to Gillies and McCurdy, “the conditional propensity constitutes an objective relationship between two events and any increase in our information about one when we learn of the other is a completely separate matter.” [Humphreys, 1985, p. 563] But this ends up leading to the conclusion I mentioned above — conditional propensity doesn’t obey the mathematical relationships traditionally used for conditional probability. To make the minimal modification to the standard framework, one natural suggestion (embraced by Fetzer and others) is that in cases like the ones Humphreys describes, conditional propensities are defined in one direction, but undefined in the other. That is, P (A|B) makes sense when A is the event of a coin flip coming up heads, and B is the event of it being flipped fairly, but not when the two are reversed. (This is a difference from Humphreys’ picture, on which the inverse conditional propensity has a value of 0, 1, or the value of a related unconditional probability, rather than being undefined.) Then, we just require that conditional propensities obey the standard probability axioms whenever they are defined. However, the example of the coin flipping device in a magnetic room seems to suggest otherwise - in that case, the only way to get the intuitively correct conditional probabilities involves violating the ratio account of conditional probability, even though there is no division by zero or anything else mathematically strange going on. But I think that making this much of a departure from the standard Kolmogorov axioms for probability makes sense in this case, at least in part because of the metaphysics of the objects of the probability functions. Humphreys suggests that “the arguments of the propensity functions are names designating specific physical events. They do not pick out subsets of an outcome space as in the measure-theoretic approach.” [Humphreys, 2004, p. 669] This, if correct, would be a further departure from the standard mathematical theory. Since much of the motivation for the ratio formula comes from taking the objects of the probability function to be sets or sentences, this helps further undermine the argument that chance must satisfy this formula.
142
Kenny Easwaran
Other considerations on conditional chance come up when considering its uses. Humphreys’ paradox (which motivates moving away from the standard mathematical formalism) proceeds from an intuition about what conditional chance means. H´ajek mentions that evidential decision theory uses conditional probability — it may be that evidential decision theory makes use of conditional credences, while causal decision theory [Joyce, 1999] makes use of the agent’s expectation of the conditional chances. If so, then the differences between the decision theories motivates a difference between the notions of conditional probability. In [Lewis, 1980], Lewis suggests another use for conditional chance, which is to describe how unconditional chances change over time. The idea is that chances are explicitly indexed to times. Thus, at two different times t1 and t2 , there are generally two distinct probability functions P1 and P2 giving the chances for various events. Lewis argues that P2 (A) = P1 (A|B), where B is “the complete history of the interval between” the two times. [Lewis, 1980, p. 280] In [Lange, 2006], there are some arguments against this claim — however, these depend on whether objective chances can change without any particular non-dispositional fact being the basis of such a change. At any rate, the modified claim is that instead of B being the history of all events occurring in the interval, B is specified to be the conjunction of outcomes of all chance events in the interval. But in either case, only very specific conditional probabilities are needed — these are always probabilities of a later event conditional on an earlier one (so none of Humphreys’ inverse conditional probabilities need to be used), and these are in fact always probabilities conditional on complete sets of occurrences between two times. It is plausible that such cases are never like the case I described with the coin-flipping device in a magnetized room, so that the question of whether those sorts of cases alo violate the ratio account of conditional probability doesn’t come up. Thus, this use of conditional chance doesn’t seem to cut one way or the other in a dispute about whether conditional chances should behave like other conditional probabilities. If anything, this use suggests that Humphreys may be right about the order of the events being relevant to whether the conditional chance is even defined. In conclusion, I have argued that chances are properties of events, rather than propositions, sets of worlds, or sentences. Because of the connection between chances and dispositions, and the role of chance in causal decision theory, it seems that the ratio analysis of conditional probability must fail, even in cases where the probabilities are non-zero. I don’t have a substantive positive theory to offer in place of this, but this is a question for future research.
1.2
Subjective Probability
The second interpretation of probability is the “subjective” or “Bayesian” interpretation. The probability function is said to give an agent’s “degrees of belief” or
The Varieties of Conditional Probability
143
“credences”.4 In this case, I think the standard account is correct as far as it goes — the difference only arises in the case where P (B) doesn’t have a precise nonzero value. In these cases, I argue that conditional probability is best analyzed by the account described by H´ ajek in Section 5, as “Kolmogorov’s refinement”. The primary reason is that such an account is entailed by the principle of “conglomerability,” (which H´ ajek mentions in Section 8.4) which states that there cannot be a partition G (a set of propositions such that the agent is certain that exactly one proposition in the set is true) such that for every G ∈ G, P (A|G) > P (A). Basically, this means that no experiment can be such that every single conceivable outcome of the experiment would confirm A. (Since most interpretations of probability lack a notion corresponding to confirmation, this motivation for the principle only extends to the subjective interpretation.) A slightly weaker principle that also pushes in this direction is an identification of two notions of independence — conglomerability entails that if A is independent of partition G in the sense that P (A|G) is constant for all G ∈ G, then A is independent of G in the sense that P (A|G) = P (A) for all G ∈ G. I also suggest that the examples of “impropriety” that H´ajek attributes in Section 5.2 to Seidenfeld, Schervish, and Kadane, are irrelevant for probability as degree of belief. The particular probability spaces and partitions they use are so complicated that they can only be grasped by minds that are far more complicated than the ones that we normally attribute subjective probabilities to. (In particular, to actually construct a specific example of such a space, an agent would need to independently consider uncountably many distinct propositions.) However, other interpretations of probability don’t obviously require that every partition involved be graspable by a finite mind. For those interpretations, conglomerability may well fail, and thus a different mathematical account of conditional probability will be relevant.5 A further apparent problem for Kolmogorov’s refinement is that a single event G may be an element of two different partitions G and G′ , and P (A|G) may be constrained to have two distinct values by these different partitions. ([Kadane et al., 1986] shows that there are certain types of probability space in which this is unavoidable.) However, I suggest that this means we must view conditional probability as (in general) a three-place function, depending not only on A and G, but also the partition G defining the set of “relevant alternatives” to G. In particular cases, this partition will be specified by the experiment an agent is considering G as an outcome to, or the set of alternative hypotheses under consideration, or some other contextual factor. Thus, we must think of conditional degree of belief 4 For much more on this topic, including more detailed versions of these arguments, and responses to some objections, see my dissertation. [Easwaran, 2008] I also explain in more detail what role I think conditional and unconditional probability play on this interpretation, which is an important part of arguing for any particular formalism. 5 The failure of conglomerability H´ ajek mentions in Section 8.5, in the two-envelope paradox, depends essentially on the possibility of unboundedly large payoffs. I endorse only conglomerability for probabilities and not for expectations, and since probabilities are never greater than 1, these sorts of cases can’t arise.
144
Kenny Easwaran
as a function P (A|G, G) rather than just P (A|G).6 In my dissertation I argue that this relativization of the notion to a further parameter doesn’t cause any problems for the applications of conditional degree of belief in its usual settings. I do, however, think that the standard account is right to treat the objects of the probability function as something like sets of possible outcomes (in this case, sets of epistemic possibilities, whatever those are). H´ajek claims in Section 4.4 that Kolmogorov’s account conflates probability zero events with impossibilities — however, I think Kolmogorov’s account actually preserves this distinction better than some of the alternatives. On his account, events are represented by sets, and their probability is assigned by a function. It’s true that this function assigns zero both to impossible and non-impossible events, but this distinction shows up in the sets themselves — impossible events are represented by the empty set, while non-impossible events are represented by non-empty sets. In my dissertation, I use this very argument to motivate the claim that at least subjective probabilities are best understood by taking the bearers of probability values to be sets of some sort. This contrasts with accounts like Hailperin’s, on which the events themselves have no internal structure to distinguish impossible from non-impossible ones.7 In summary, subjective probabilities are relations between an agent and a set of epistemic possibilities, rather than a sentence or an event. Although Kolmogorov’s axioms are all appropriate constraints for the relation between conditional and unconditional degrees of belief, the need for conglomerability (to preserve some basic ideas about evidence and independence) means that conditional degrees of belief must actually be relativized to partitions, and must obey Kolmogorov’s more sophisticated account of conditional probability, rather than the simple ratio analysis.
1.3
Logical probability
This interpretation is intended to give some sort of generalization of deductive logic’s notion of entailment to a notion of “partial” or “inductive” entailment. When B entails A deductively, P (A|B) = 1; when B entails ¬A, P (A|B) = 0; in other cases, intermediate values are possible. In this particular case, I agree with H´ ajek’s claim that conditional probability must be the fundamental notion — just as a notion of logical entailment must underlie any notion of logical truth, it looks like a notion of logical conditional probability must underlie any notion of logical probability. Thus, the proper account must again differ from the standard 6 One might worry that this would cause further problems with our ability to state the Principal Principle. However, the Principal Principle only considers probabilities conditional on specific values of the chance function, which suggests that we should use the partition by values of chances to fill in the third spot in the conditional probability function. Although this might seem not to put many constraints on credences with respect to other partitions, it does (together with conglomerability) entail that an agent’s unconditional credence in an event match her expected value of the unconditional chance, which is most of the work the principle tends to be used for. 7 Popper’s account does better here — although it, like Hailperin’s, has no internal structure to the events, the impossible events can be distinguished as those events B such that for every A, P (A|B) = 1.
The Varieties of Conditional Probability
145
account mentioned above, in a way different from the previous two interpretations. Although this might suggest that Popper’s account is the right one for this interpretation, I think this is not totally clear. For one thing, the standard notion of logical consequence is not fundamentally a relation A ⊢ B between two sentences, but rather a relation Γ ⊢ B between a set of sentences Γ and a sentence. To be sure, there is a special case of this relation where Γ contains exactly one sentence. Additionally, when Γ is a finite set, the set-based relation holds between Γ and B iff the sentence-based relation holds between the conjunction of the elements of Γ and B. Thus, if the set is finite, one might try to reduce a function taking a set and a sentence as arguments to a function taking two sentences as arguments.8 In the case of complete entailment (that is, when the conditional probability is 1 or 0), classical first-order logic allows for the infinite case to be reduced to the finite case. The compactness theorem states that if Γ is an infinite set and Γ ⊢ B, then there is some finite Γ0 ⊂ Γ such that Γ0 ⊢ B. However, although complete entailment by an infinite set can be reduced to complete entailment by a finite subset, there is no guarantee that this will be the case for partial entailment. The logic of complete entailment is monotonic — if a subset of Γ entails B, then Γ entails B. But if a set of sentences only partially entails a sentence, then surely adding more sentences to the collection of premises can make this partial entailment either better or worse. Thus, there seems to be no hope of reducing the infinite-premise case to the single-premise case in general. If the underlying notion of entailment to be generalized is not classical first-order logic, then the reduction may even be impossible in the case of full entailment. Thus, if the logical interpretation of probability makes sense (in spite of the many objections that have been raised), it may need to be given by a sort of conditional probability function that allows for conditionalizing on a set of sentences, rather than just a single sentence, as all current formalisms for conditional probability do.
2
CONDITIONAL PROBABILITY VS. PROBABILITY GIVEN THE BACKGROUND
H´ajek gives several different arguments for the claim that conditional probability ought to be taken as the basic notion of probability theory, rather than unconditional probability, as is standard. Given what I have said above, these arguments will need to be made separately for each interpretation of probability, rather than for all of them simultaneously. But at any rate, I think an important distinction to be made in these arguments is the distinction in degree of belief terms between probability conditional on some evidence E, and probability on a corpus of background knowledge B. This distinction may be more clear for objective 8 Even in the finite case though this seem to blur some distinctions. After all, the fact that {A, B} ⊢ A ∧ B surely tells us something different than the fact that A ∧ B ⊢ A ∧ B.
146
Kenny Easwaran
chances — if P is the probability function giving the chances at time t, then this is the distinction between P (A|E) where E is itself some possible event later in time than t, and P (A|B), where B is the complete description of the history of the universe up to time t. Similar distinctions may apply for some of the other interpretations of probability. To distinguish these, I will call the latter notion “background probability” and reserve the term “conditional probability” for the former. The distinction is approximately that the “background” consists of a collection of information that is necessary for the assignment of probability to even make sense, while a “condition” is some further information that could itself have had a probability given the background. If there really are such sets of background information that are necessary for the assignment of probabilities, then background probabilities seem to play the role that we originally assigned for unconditional probability, and not for conditional probability. There would be no room for a truly “unconditional” probability independent even of a background. H´ ajek’s argument at the end of Section 6 that the fundamental notion of probability theory is conditional probability seems to be based on this notion of background probability, rather than the more specific notion of conditional probability that I distinguish from it. Thus, if these two notions require different formalisms, then he has only established that background probabilities are fundamental, and not conditional probabilities. To see this distinction in the case of the propensity interpretation, consider H´ajek’s example of a use for conditional probability: “Probability is a measure of the tendency for a certain kind of experimental set-up to produce a particular outcome . . . [I]t is a conditional probability, the condition being a specification of the experimental set up.” (Section 3.1) This contrasts with the account suggested in [Lewis, 1980], on which the propensity function is indexed to a world and time. With the former point of view, unconditional probabilities don’t even make sense (what’s the probability that a coin lands heads, if nothing about the coin’s history or composition or manner of flipping is definite?) while on the latter point of view, both conditional and unconditional probabilities can be understood. If t, w is the time/world-pair at which a coin is about to be flipped, then we can make sense of both Pt,w (H) and Pt,w (H|A), where H is the event of the coin coming up heads, and A is (for example) the event of a substantial amount of matter quantum tunneling away from the coin in mid-air, thus disrupting the rotation of the coin as it flips. Indexing to the time obviates the need to conditionalize on the set-up, or the history, or anything else in determining a probability value. Thus, there is one picture of the propensity interpretation on which all propensities are conditional on a background (but this is not real conditional probability, in which we might conditionalize on more information than what is necessary to set up the experiment), and another on which there are truly unconditional propensities (though they are indexed to worlds and times, which suffice to determine the background conditions). In neither case do the truly conditional propensities play the fundamental role. In the case of degree of belief, there are again two options for describing things.
The Varieties of Conditional Probability
147
We can index the probability function to an agent at a time, who happens to already have certain knowledge and belief. In that case, there are unconditional probabilities (her degrees of belief in particular propositions that she doesn’t already know or fully believe) and conditional probabilities (her degrees of belief in particular propositions when she temporarily supposes further propositions). The other option is to always explicitly include all the agent’s knowledge and information and beliefs in the proposition that is being conditioned on. This is the picture that seems to fit H´ ajek’s account (citing de Finetti) of all probabilities really being conditional probabilities. But this conflates the things the agent actually knows or believes with things that she is merely supposing for the sake of a conditional probability. Additionally, it invites consideration of probabilities conditional on sets of information that don’t include her full set of knowledge and beliefs. In some cases we can perhaps make sense of these alternate conditional probabilities as expressing counterfactual possibilities of what she imagines she would believe if she didn’t have some of the information she does now. (This would make the conditional probability function conflate three types of attitudes an agent might have to a proposition.) But in cases where we consider a radically impoverished set of information in the condition, it seems that the conditional probability doesn’t really tell us anything meaningful about the agent. At least, the only way around this seems to be to take seriously the notion of a hypothetical prior probability function that the agent would have had in the absence of any information whatsoever. In the case of logical probability, the distinction I am making between background and truly conditional probabilities doesn’t seem to hold up. There is supposed to be one correct logical probability function, and this function measures degree of entailment between sentences. No background is necessary for assigning these probabilities — all probabilities are really conditional. This is different from the other cases, where it seems that some sort of background is necessary to even begin to assign probabilities (there are no chances without a world and time, there are no degrees of belief absent some agent and time), but that other information might be hypothetically added in a way that gives an interesting (but distinct) notion of conditional probability. For logical probabilities it seems that only the latter exists, but that it really is the basic notion. This distinction can be challenged in both the propensity and degree of belief cases. But in the propensity case this would involve arguing that there really is some objective chance function that specified the chances of various events even in the absence of an experimental set-up, including any of the specifications of how the universe works. And in the degree of belief case it would involve arguing that there really are hypothetical priors that have some real meaning for a particular agent. The difficulty of arguing for such unified functions that can assign probabilities in the absence of information is one of the serious challenges that has been raised against the logical interpretation of probability. Instead, it seems more promising to me to just index chances to worlds and times, and index degrees of belief to agents and times, so that the background doesn’t have to be explicitly mentioned in
148
Kenny Easwaran
every probability statement. But this just means that there is sense in these cases to be made of a notion of unconditional probability that is not merely derivative from conditional probability, and that some arguments suggesting a priority for conditional probability identify the background information of the situation with the actual objects of the conditional probability function. Thus, although it may be the case that all non-logical interpretations of probability depend on some set of information in order to assign probabilities in particular cases, the role this information plays can be very different from the role that the antecedent of a conditional probability plays in that interpretation. Thus, this background information is not a good argument for the claim that conditional probabilities are more fundamental than unconditional. BIBLIOGRAPHY [Easwaran, 2008] K. Easwaran. The Foundations of Conditional Probability. PhD thesis, University of California, Berkeley, 2008. [Eells, 1983] E. Eells. Objective probability theory theory. Synthese, 57:387–442, 1983. [Fara, 2006] M. Fara. Dispositions. Stanford Encyclopedia of Philosophy, 1006. [Gillies, 2000] D. Gillies. Varieties of propensity. British Journal for the Philosophy of Science, 51:807–835, 2000. [Goel and Zellner, 1986] P. K. Goel and A. Zellner, eds. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. North-Holland, 1986. [Hailperin, 1984] T. Hailperin. Probability logic. Notre Dame Journal of Formal Logic, 25(3):198–212, 1984. [Hailperin, 1997] T. Hailperin. Ontologically neutral logic. History and Philosophy of Logic, 18:185–200, 1997. [Hailperin, 2000] T. Hailperin. Probability semantics for quantifier logic. Journal of Philosophical Logic, 29:207–239, 2000. [Hailperin, 2006] T. Hailperin. Probability logic and combining evidence. History and Philosophy of Logic, 27:249–269, 2006. [Humphreys, 1985] P. Humphreys. Why propensities cannot be probabilities. The Philosophical Review, 94(4):557–570, 1985. [Humphreys, 2004] P. Humphreys. Some considerations on conditional chances. British Journal for the Philosophy of Science, 55:667–680, 2004. [Joyce, 1999] J. Joyce. The Foundations of Causal Decision Theory. Cambridge University Press, 1999. [Kadane et al., 1986] J. B. Kadane, M. J. Schervish, and T. Seidenfeld. Statistical implications of finitely additive probability. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, P. K. Goel and A. Zellner, eds. North-Holland, 1986 [Kolmogorov, 1950] A. N. Kolmogorov. Foundations of the Theory of Probability. Chelsea, 1950. [Lange, 2006] M. Lange. Do chances receive equal treatment under the laws? or: Must chances be probabilities? British Journal for the Philosophy of Science, 57:383–403, 2006. [Lewis, 1980] Lewis, D. (1980). A subjectivist’s guide to objective chance. In Jeffrey, R., editor, Studies in Inductive Logic and Probability, volume II. University of California Press. [Lewis, 1994] D. Lewis. Humean supervenience debugged. Mind, 109(412):473–490, 1994. [McCurdy, 1996] C. McCurdy. Humphreys’s paradox and the interpretation of inverse conditional probabilities. Synthese, 108:105–125, 1996. [Meek and Glymour, 1994] C. Meek and C. Glymour. Conditioning and intervening. British Journal for the Philosophy of Science, 45:1001–1021, 1994. [Popper, 1959] K. Popper. The Logic of Scientific Discovery, chapter iv*, pages 326–348. Harper & Row, 959. [R´ enyi, 1970] A. R´ enyi. Foundations of Probability. Holden-Day, 1970.
Part III
Four Paradigms of Statistics
This page intentionally left blank
Classical Statistics Paradigm
This page intentionally left blank
ERROR STATISTICS Deborah G. Mayo and Aris Spanos
1
WHAT IS ERROR STATISTICS?
Error statistics, as we are using that term, has a dual dimension involving philosophy and methodology. It refers to a standpoint regarding both: 1. a cluster of statistical tools, their interpretation and justification, 2. a general philosophy of science, and the roles probability plays in inductive inference. To adequately appraise the error statistical approach, and compare it to other philosophies of statistics, requires understanding the complex interconnections between the methodological and philosophical dimensions in (1) and (2) respectively. To make this entry useful while keeping to a manageable length, we restrict our main focus to (1) the error statistical philosophy. We will however aim to bring out enough of the interplay between the philosophical, methodological, and statistical issues, to elucidate long-standing conceptual, technical, and epistemological debates surrounding both these dimensions. Even with this restriction, we are identifying a huge territory marked by generations of recurring controversy about how to specify and interpret statistical methods. Understandably, standard explications of statistical methods focus on the formal mathematical tools without considering a general philosophy of science or of induction in which these tools best fit. This is understandable for two main reasons: first, it is not the job of statisticians to delve into philosophical foundations, at least explicitly. The second and deeper reason is that the philosophy of science and induction for which these tools are most apt—we may call it the error statistical philosophy —will differ in central ways from traditional conceptions of the scientific method. These differences may be located in contrasting answers to a fundamental pair of questions that are of interest to statisticians and philosophers of science: • How do we obtain reliable knowledge about the world despite uncertainty and threats of error? • What is the role of probability in making reliable inferences? Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
154
Deborah G. Mayo and Aris Spanos
To zero in on the main issues, we will adopt the following, somewhat unusual strategy: We shall first set out some of the main ideas within the broader error statistical philosophy of induction and inquiry, identifying the goals and requirements that serve both to direct and justify the use of formal statistical tools. Next we turn to explicating statistical methods, of testing and estimation, while simultaneously highlighting classic misunderstandings and fallacies. The error statistical account we advocate builds on Fisherian and Neyman-Pearsonian methods; see [Fisher, 1925; 1956; Neyman, 1952; Pearson, 1962]. While we wish to set out for the reader the basic elements of these tools, as more usually formulated, we will gradually transform them into something rather different from both Fisherian and Neyman-Pearsonian paradigms. Our goals are twofold: to set the stage for developing a more adequate philosophy of inductive inquiry, and to illuminate crucial issues for making progress on the “statistical wars”, now in their seventh decade1 .
1.1
The Error Statistical Philosophy
Under the umbrella of error-statistical methods, one may include all standard methods using error probabilities based on the relative frequencies of errors in repeated sampling – often called sampling theory or frequentist statistics. Frequentist statistical methods are sometimes erroneously equated to other accounts that employ frequentist probability, for example, the “frequentism” in the logic of confirmation. The latter has to do with using relative frequencies of occurrences to infer probabilities of events, often by the straight rule, e.g., from an observed proportion of As that are Bs to infer the proportion of As that are Bs in a population. 1. One central difference, as Neyman [1957] chided Carnap, is that, unlike frequentist logics of confirmation, frequentist statistics always addresses questions or problems within a statistical model 2 (or family of models) M intended to provide an approximate (and idealized) representation of the process generating data x0 :=(x1 , x2 , . . . , xn ). M is defined in terms of f (x; θ), the probability distribution of the sample X:=(X1 , ..., Xn ), that assigns probabilities to all events of interest belonging to the sample space RnX . Formal error statistical methods encompass the deductive assignments of probabilities to outcomes, given a statistical model M of the experiment, and inductive methods from the sample to claims about the model. Statistical inference focuses on the latter step: moving from the data to statistical hypotheses, typically couched in terms of unknown parameter(s) θ, which governs f (x; θ). 1 The
‘first act’ might be traced to the papers by Fisher [1955], Pearson [1955], Neyman [1956]. the necessity of clearly specifying the statistical model — in terms of a complete set of probabilistic assumptions — is one of the cardinal sins still committed, especially by nonstatisticians, in expositions of frequentist statistics. 2 Overlooking
Error Statistics
155
2. The second key difference is how probability arises in induction. For the error statistician probability arises not to measure degrees of confirmation or belief (actual or rational) in hypotheses, but to quantify how frequently methods are capable of discriminating between alternative hypotheses and how reliably they facilitate the detection of error. These probabilistic properties of inductive procedures are error frequencies or error probabilities. The statistical methods of significance tests and confidence-interval estimation are examples of formal error-statistical methods. A statistical inference might be an assertion about the value of θ, say that θ > 0. Error probabilities attach, not directly to θ > 0, but to the inference tools themselves, whether tests or estimators. The claims concerning θ are either correct or incorrect as regards the mechanism generating the data. Insofar as we are interested in using data to make inferences about this mechanism, in this world, it would make no sense to speak of the relative frequency of θ > 0, as ‘if universes were as plenty as blackberries from which we randomly selected this one universe’, as Peirce would say (2.684). Nevertheless, error probabilities are the basis for determining whether and how well a statistical hypothesis such as θ > 0 is warranted by data x0 at hand, and for setting bounds on how far off parameter values can be from 0 or other hypothetical values. Since it is the use of frequentist error probabilities, and not merely the use of frequentist probability that is central to this account, the term error statistics (an abbreviation of error probability statistics) seems an apt characterization. Statistical Significance Test Formally speaking, the inductive move in error statistics occurs by linking special functions of the data, d(X), known as statistics, to hypotheses about parameter(s), θ of interest. For example, a test might be given as a rule: whenever d(X) exceeds some constant c, infer θ > 0, thereby rejecting θ = 0: Test Rule: whenever {d(x0 ) > c}, infer θ > 0. Any particular application of an inductive rule can be ‘in error’ in what it infers about the data generation mechanism, so long as data are limited. If we could calculate the probability of the event {d(X) > c} under the assumption that θ=0, we could calculate the probability of erroneously inferring θ > 0. Error probabilities are computed from the distribution of d(X), the sampling distribution, evaluated under various hypothesized values of θ. The genius of formal error statistics is its ability to provide inferential tools where the error probabilities of interest may be calculated, despite unknowns. Consider a simple and canonical example of a statistical test, often called a statistical significance test. Such a test, in the context of a statistical model M, is a procedure with the following components: 1. a null hypothesis H0 , couched in terms of unknown parameter θ, and
156
Deborah G. Mayo and Aris Spanos
2. a function of the sample, d(X), the test statistic, which reflects how well or poorly the data x0 accord with the null hypothesis H0 — the larger the value of d(x0 ) the further the outcome is from what is expected under H0 — with respect to the particular question being asked. A crucial aspect of an error statistical test is its ability to ensure the sampling distribution of the test statistic can be computed under H0 and under hypotheses discrepant from H0 . In particular, this allows computing: 3. the significance level associated with d(x0 ): the probability of a worse fit with H0 than the observed d(x0 ), under the assumption that H0 is true: p(x0 ) = P (d(X) > d(x0 ); H0 ). This is known either as the observed significance level or the p-value. The larger the value of the test statistic the smaller the p-value. Identifying a relevant test statistic, together with the need to calculate the error probabilities, restricts the choices of test statistic so as to lead to a uniquely appropriate test, whether one begins with Neyman-Pearson (N-P) or Fisherian objectives [Cox, 1958]. Consider for example the case of a random sample X of size n from a Normal distribution with unknown mean µ and, for simplicity, known variance σ 2 (denoted by N(µ, σ 2 )). We want to test the hypotheses: (1) H0 : µ = µ0 vs. H1 : µ > µ0 . Pn 0) The test statistic of this one-sided test is d(X)= (X−µ , where X= n1 k=1 Xk σx √ denotes the sample mean and σx =(σ/ n). Given a particular outcome x0 , we compute d(x0 ). An inductive or ampliative inference only arises in moving from d(x0 ) — a particular outcome — to a hypothesis about parameter µ. Consider the test rule: whenever X exceeds µ0 by 1.96σx or more, infer H1 : µ > µ0 . Use of the statistic d(X) lets us write this test rule more simply: Test Rule T : whenever {d(x0 ) > 1.96}, infer H1 : µ > µ0 . We deductively arrive at the probability of the event {d(X) > 1.96} under the assumption that H0 correctly describes the data generating process, namely P (d(X) > 1.96; H0 )=.025, giving the statistical significance level .025. Tests, strictly speaking, are formal mapping rules. To construe them as inference tools requires an interpretation beyond the pure formalism, and this can be done in various ways. For instance, a Fisherian may simply output “the observed significance level is .025”; a Neyman-Pearson tester, might report “reject H0 ” having decided in advance that any outcome reaching the .025 significance level will lead to this output. But these reports themselves require fleshing out, and a good deal of controversy revolves around some of the familiar ways of doing so.
Error Statistics
157
Behavioristic and evidential philosophies By a “statistical philosophy” we understand a general conception of the aims and epistemological foundations of a statistical methodology. Thus an important task for an error statistical philosophy is to articulate the interpretation and the rationale for outputs of inductive tests. For instance, continuing with our example, a Fisherian might declare that whenever {d(x0 ) > 1.96}: “infer that H0 is falsified at level .025”. A strict Neyman-Pearsonian might identify ‘rejectH0 ’ with a specific action, such as: “publish the result” or “reject the shipment of bolts”. Consider a third construal: whenever {d(x0 ) > 1.96} infer that the data x0 is evidence for a (positive) discrepancy γ from µ0 : “infer x0 indicates µ > µ0 + γ”. The weakest claim is to infer some discrepancy — as with the typical non-null hypothesis. More informatively, as we propose, one might specify a particular positive value for γ. For each we can obtain error probabilistic assertions: Assuming that H0 is true, the probability is .025 that: • H0 is falsified (at this level), • the shipment of bolts is rejected, • some positive discrepancy γ from µ0 is inferred. Each of these interpretations demands a justification or rationale, and it is a crucial part of the corresponding statistical philosophy to supply it. Error statistical methods are typically associated with two distinct justifications: Behavioristic rationale. The first stresses the ability of tests to control error probabilities at some low level in the long run. This goal accords well with what is generally regarded as Neyman’s statistical philosophy wherein tests are interpreted as tools for deciding “how to behave” in relation to the phenomena under test, and are justified in terms of their ability to ensure low long-run errors. Neyman [1971] even called his tests tools for inductive behavior, to underscore the idea that the test output was an action, as well as to draw the contrast with Bayesian inductive inference in terms of degrees of belief.
158
Deborah G. Mayo and Aris Spanos
Inferential rationale A non-behavioristic or inferential justification stresses the relevance of error probabilities for achieving inferential and learning goals. To succeed it must show how error probabilities can be used to characterize warranted inference in some sense. The key difference between behavioristic and inferential construals of tests is not whether one views an inference as a kind of decision (which one is free to do), but rather in the justificatory role given to error probabilities. On the behavioristic philosophy, the goal is to adjust our behavior so that in the long-run we will not act erroneously too often: it regards low long-run error rates (ideally, optimal ones) alone as what justifies a method. This does not yield a satisfactory error statistical philosophy in the context of scientific inference. How to provide an inferential philosophy for error statistics has been the locus of the most philosophically interesting controversies. Although our main focus will be on developing an adequate inferential construal, there are contexts wherein the more behavioristic construal is entirely appropriate, and we propose to retain it within the error statistical umbrella3 . When we speak of “the context of scientific inference” we refer to a setting where the goal is an inference about what is the case regarding a particular phenomenon. Objectivity in error statistics Underlying the error statistical philosophy, as we see it, is a conception of the objective underpinnings for uncertain inference: although knowledge gaps leave plenty of room for biases, arbitrariness, and wishful thinking, in fact we regularly come up against experiences that thwart our expectations, disagree with the predictions and theories we try to foist upon the world, and this affords objective constraints on which our critical capacity is built. Getting it (at least approximately) right, and not merely ensuring internal consistency or agreed-upon convention, is at the heart of objectively orienting ourselves toward the world. Our ability to recognize when data fail to match anticipations is what affords us the opportunity to systematically improve our orientation in direct response to such disharmony. Much as Popper [1959] takes the ability to falsify as the foundation of objective knowledge, R.A. Fisher [1935, p. 16] developed statistical significance tests based on his view that “every experiment may be said to exist only in order to give the facts the chance of disproving the null hypothesis”. Such failures could always be avoided by “immunizing” a hypothesis against criticism, but to do so would prevent learning what is the case about the phenomenon in question, and thus flies in the face of the aim of objective science. Such a strategy would have a very high probability of saving false claims. However, what we are calling the error-statistical philosophy goes beyond falsificationism, of both the Popperian and Fisherian varieties, most notably in its consideration of what positive inferences are licensed when data do not falsify but rather accord with a hypothesis or claim. Failing to falsify hypotheses, while rarely allowing their acceptance as precisely true, may warrant excluding various discrepancies, errors or rivals. Which ones? 3 Even
in science there are tasks whose goal is avoiding too much noise in the network.
Error Statistics
159
Those which, with high probability, would have led the test to issue a more discordant outcome, or a more statistically significant result. In those cases we may infer that the discrepancies, rivals, or errors are ruled out with severity. Philosophy should direct methodology (not the other way around). To implement the error statistical philosophy requires methods that can accomplish the goals it sets for uncertain inference in science. This requires tools that pay explicit attention to the need to communicate results so as to set the stage for others to check, debate, scrutinize and extend the inferences reached. Thus, any adequate statistical methodology must provide the means to address legitimate critical questions, to give information as to which conclusions are likely to stand up to further probing, and where weaknesses remain. The much maligned, automatic, recipelike uses of N-P tests wherein one accepts and rejects claims according to whether they fall into prespecified ‘rejections regions’ are uses we would also condemn. Rather than spend time legislating against such tests, we set out principles of interpretation that automatically scrutinize any inferences based on them. (Even silly tests can warrant certain claims.) This is an important source of objectivity that is open to the error statistician: choice of test may be a product of subjective whims, but the ability to critically evaluate which inferences are and are not warranted is not. Background knowledge in the error statistical framework of ‘active’ inquiry The error statistical philosophy conceives of statistics very broadly to include the conglomeration of systematic tools for collecting, modeling and drawing inferences from data, including purely ‘data analytic’ methods that are normally not deemed ‘inferential’. In order for formal error statistical tools to link data, or data models, to primary scientific hypotheses, several different statistical hypotheses may be called upon, each permitting an aspect of the primary problem to be expressed and probed. An auxiliary or ‘secondary’ set of hypotheses are needed to check the assumptions of other models in the complex network; see section 4. Its ability to check its own assumptions is another important ingredient to the objectivity of this approach. There is often a peculiar allegation (criticism) that: (#1)
error statistical tools forbid using any background knowledge,
as if one must start each inquiry with a blank slate. This allegation overlooks the huge methodology of experimental design, data analysis, model specification and work directed at linking substantive and statistical parameters. A main reason for this charge is that prior probability assignments in hypotheses do not enter into the calculations (except in very special cases). But there is no reason to suppose the kind of background information we need in order to specify and interpret statistical methods can or should be captured by prior probabilities in the hypotheses being studied. (We return to this in section 3). But background knowledge must enter
160
Deborah G. Mayo and Aris Spanos
in designing, interpreting, and combining statistical inferences in both informal and semi-formal ways. Far from wishing to inject our background opinions in the hypotheses being studied, we seek designs that help us avoid being misled or biased by initial beliefs. Although we cannot fully formalize, we can systematize the manifold steps and interrelated checks that, taken together, constitute a fullbodied experimental inquiry that is realistic. The error statistician is concerned with the critical control of scientific inferences by means of stringent probes of conjectured flaws and sources of unreliability. Standard statistical hypotheses, while seeming oversimple in and of themselves, are highly flexible and effective for the piece-meal probes the error statistician seeks. Statistical hypotheses offer ways to couch conjectured flaws in inference, such as: mistaking spurious for genuine correlations, mistaken directions of effects, mistaken values of parameters, mistakes about causal factors, mistakes about assumptions of statistical models. The qualities we look for to express and test hypotheses about such inference errors are generally quite distinct from those required of the substantive scientific claims about which we use statistical tests to learn. Unless the overall error statistical philosophy is recognized, the applicability and relevance of the formal methodology will be misunderstood, as it often is. Although the overarching goal of inquiry is to find out what is (truly) the case about aspects of phenomena, the hypotheses erected in the actual processes of finding things out are generally approximations (idealizations) and may even be deliberately false. The picture corresponding to error statistics is one of an activist learner in the midst of an inquiry with the goal of finding something out. We want hypotheses that will allow for stringent testing so that if they pass we have evidence of a genuine experimental effect. The goal of attaining such well-probed hypotheses differs crucially from seeking highly probable ones (however probability is interpreted). We will say more about this in section 3.
1.2
An Error Statistical Philosophy of Science
The error statistical philosophy just sketched alludes to the general methodological principles and foundations associated with frequentist error statistical methods. By an error statistical philosophy of science, on the other hand, we have in mind the application of those tools and their interpretation to problems of philosophy of science: to model scientific inference (actual or rational), to scrutinize principles of inference (e.g., prefer novel results, varying data), and to frame and tackle philosophical problems about evidence and inference (how to warrant data, pinpoint blame for anomalies, test models and theories). Nevertheless, each of the points
Error Statistics
161
in 1.1 about statistical methodology has direct outgrowths for the philosophy of science dimension. The outgrowths yield: (i) requirements for an adequate philosophy of evidence and inference, but also (ii) payoffs for using statistical science to make progress on philosophical problems. (i)
In order to obtain a philosophical account of inference from the error statistical perspective, one would require forward-looking tools for finding things out, not for reconstructing inferences as ‘rational’ (in accordance with one or another view of rationality). An adequate philosophy of evidence would have to engage statistical methods for obtaining, debating, rejecting, and affirming data. From this perspective, an account of scientific method that begins its work only once well-defined evidence claims are available forfeits the ability to be relevant to understanding the actual processes behind the success of science.
(ii)
Conversely, it is precisely because the contexts in which statistical methods are most needed are ones that compel us to be most aware of strategies scientists use to cope with threats to reliability, that considering the nature of statistical method in the collection, modeling, and analysis of data is so effective a way to articulate and warrant principles of evidence.
In addition to paving the way for richer and more realistic philosophies of science, we claim, examining error statistical methods sets the stage for solving or making progress on long-standing philosophical problems about evidence and inductive inference. • Where the recognition that data are always fallible presents a challenge to traditional empiricist foundations, the cornerstone of statistical induction is the ability to move from less to more accurate data. • Where the best often thought feasible is getting it right in some asymptotic long-run, error statistical methods ensure specific precision in finite samples, and supply ways to calculate how large a sample size n needs to be. • Where pinpointing blame for anomalies is thought to present insoluble Duhemian problems and underdetermination, a central feature of error statistical tests is their capacity to evaluate error probabilities that hold regardless of unknown background or nuisance parameters. • Where appeals to statistics in conducting a meta-methodology too often boil down to reconstructing one’s intuition in probabilistic terms, statistical principles of inference do real work for us — in distinguishing when and why violations of novelty matter, when and why irrelevant conjuncts are poorly supported, and so on.
162
Deborah G. Mayo and Aris Spanos
Although the extended discussion of an error statistical philosophy of science goes beyond the scope of this paper (but see [Mayo, 1996; Mayo and Spanos, 2010]), our discussion should show the relevance of problems in statistical philosophy for addressing the issues in philosophy of science — which is why philosophy of statistics is so rich a resource for epistemologists. In the next section we turn to the central error statistical principle that links (1.1) the error statistical philosophy and (1.2) an error statistical philosophy of science.
1.3
The Severity Principle
A method’s error probabilities describe its performance characteristics in a hypothetical sequence of repetitions. How are we to use error probabilities in making particular inferences? This leads to the general question: When do data x0 provide good evidence for, or a good test of, hypothesis H? Our standpoint begins with the situation in which we would intuitively deny x0 is evidence for H. Data x0 fail to provide good evidence for the truth of H if the inferential procedure had very little chance of providing evidence against H, even if H is false. Severity Principle (weak). Data x0 (produced by process G) do not provide good evidence for hypothesis H if x0 results from a test procedure with a very low probability or capacity of having uncovered the falsity of H, even if H is incorrect. Such a test we would say is insufficiently stringent or severe. The onus is on the person claiming to have evidence for H to show that they are not guilty of at least so egregious a lack of severity. Formal error statistical tools are regarded as providing systematic ways to foster this goal, as well as to determine how well it has been met in any specific case. Although one might stop with this negative conception (as perhaps Fisher and Popper did), we will go on to the further, positive one, which will comprise the full severity principle: Severity Principle (full). Data x0 (produced by process G) provides good evidence for hypothesis H (just) to the extent that test T severely passes H with x0 . Severity rationale vs. low long-run error-rate rationale (evidential vs. behavioral rationale) Let us begin with a very informal example. Suppose we are testing whether and how much weight George has gained between now and the time he left for Paris, and do so by checking if any difference shows up on a series of well-calibrated and stable weighing methods, both before his leaving and upon his return. If no change
Error Statistics
163
on any of these scales is registered, even though, say, they easily detect a difference when he lifts a .1-pound potato, then this may be regarded as grounds for inferring that George’s weight gain is negligible within limits set by the sensitivity of the scales. The hypothesis H here might be: H: George’s weight gain is no greater than δ, where δ is an amount easily detected by these scales. H, we would say, has passed a severe test: were George to have gained δ pounds or more (i.e., were H false), then this method would almost certainly have detected this. A behavioristic rationale might go as follows: If one always follows the rule going from failure to detect a weight gain after stringent probing to inferring weight gain no greater than δ, then one would rarely be wrong in the long run of repetitions. While true, this is not the rationale we give in making inferences about George. It is rather that this particular weighing experiment indicates something about George’s weight. The long run properties — at least when they are relevant for particular inferences — utilize error probabilities to characterize the capacity of our inferential tool for finding things out in the particular case. This is the severity rationale. We wish to distinguish the severity rationale from a more prevalent idea for how procedures with low error probabilities become relevant to a particular application; namely, the procedure is rarely wrong, therefore, the probability it is wrong in this case is low. In this view we are justified in inferring H because it was the output of a method that rarely errs. This justification might be seen as intermediate between full-blown behavioristic justifications, and a genuine inferential justification. We may describe this as the notion that the long run error probability ‘rubs off’ on each application. This still does not get at the reasoning for the particular case at hand. The reliability of the rule used to infer H is at most a necessary and not a sufficient condition to warrant inferring H. What we wish to sustain is this kind of counterfactual statistical claim: that were George to have gained more than δ pounds, at least one of the scales would have registered an increase. This is an example of what philosophers often call an argument from coincidence: it would be a preposterous coincidence if all the scales easily registered even slight weight shifts when weighing objects of known weight, and yet were systematically misleading us when applied to an object of unknown weight. Are we to allow that tools read our minds just when we do not know the weight? To deny the warrant for H, in other words, is to follow a highly unreliable method: it would erroneously reject correct inferences with high or maximal probability (minimal severity), and thus would thwart learning. The stronger, positive side of the severity principle is tantamount to espousing the legitimacy of strong arguments from coincidence. What statistical tests enable us to do is determine when such arguments from coincidence are sustainable (e.g., by setting up null hypotheses). It requires being very specific about which inference is thereby warranted—we may, for example, argue from coincidence for a genuine, non-spurious, effect, but not be able to sustain an argument to the truth of a theory or even the reality of an entity.
164
Deborah G. Mayo and Aris Spanos
Passing a Severe Test. We can encapsulate this as follows: A hypothesis H passes a severe test T with data x0 if, (S-1) x0 accords with H, (for a suitable notion of accordance) and (S-2) with very high probability, test T would have produced a result that accords less well with H than x0 does, if H were false or incorrect. Equivalently, (S-2) can be stated: (S-2)*: with very low probability, test T would have produced a result that accords as well as or better with H than x0 does, if H were false or incorrect. Severity, in our conception, somewhat in contrast to how it is often used, is not a characteristic of a test in and of itself, but rather of the test T , a specific test result x0 , and a specific inference H (not necessarily predesignated) being entertained. That is, the severity function has three arguments. We use the notation: SEV (T, x0 , H), or even SEV (H), to abbreviate: “The severity with which claim H passes test T with outcome x0 ”. As we will see, the analyses may take different forms: one may provide a series of inferences that pass with high and low severity, serving essentially as benchmarks for interpretation, or one may fix the inference of interest and report the severity attained. The formal statistical testing apparatus does not include severity assessments, but there are ways to use the error statistical properties of tests, together with the outcome x0 , to evaluate a test’s severity in relation to an inference of interest. This is the key for the inferential interpretation of error statistical tests. While, at first blush, a test’s severity resembles the notion of a test’s power, the two notions are importantly different; see section 2. The severity principle, we hold, makes sense of the underlying reasoning of tests, and addresses chronic problems and fallacies associated with frequentist testing. In developing this account, we draw upon other attempts to supply frequentist foundations, in particular by Bartlett, Barnard, Birnbaum, Cox, Efron, Fisher, Lehmann, Neyman , E. Pearson; the severity notion, or something like it, affords a rationale and unification of several threads that we have extracted and woven together. Although mixing aspects from N-P and Fisherian tests is often charged as being guilty of an inconsistent hybrid [Gigerenzer, 1993], the error statistical umbrella, linked by the notion of severity, allows for a coherent blending of elements from both approaches. The different methods can be understood as relevant for one or another type of question along the stages of a full-bodied inquiry. Within the error statistical umbrella, the different methods are part of the panoply of methods that may be used in the service of severely probing hypotheses.
Error Statistics
165
A principle for interpreting statistical inference vs. the goal of science We should emphasize at the outset that while severity is the principle on which interpretations of statistical inferences are based, we are not claiming it is the goal of science. While scientists seek to have hypotheses and theories pass severe tests, severity must be balanced with informativeness. So for example, trivially true claims would pass with maximal severity, but they would not yield informative inferences4 . Moreover, one learns quite a lot from ascertaining which aspects of theories have not yet passed severely. It is the basis for constructing rival theories which existing tests cannot distinguish, and is the impetus for developing more probative tests to discriminate them (see [Mayo, 2010a]). 2
A PHILOSOPHY FOR ERROR STATISTICS
We review the key components of error statistical tests, set out the core ingredients of both Fisherian and N-P tests, and then consider how the severity principle directs the interpretation of frequentist tests. We are then in a position to swiftly deal with the specific criticisms lodged at tests.
2.1
Key Components of Error-Statistical Tests
While we focus, for simplicity, on inferences relating to the simple normal model defined in section 1.1, the discussion applies to any well-defined frequentist test. A One-Sided Test Tα . We elaborate on the earlier example in order to make the severity interpretation of tests concrete. EXAMPLE 1. Test Tα . Consider a sample X :=(X1 , ..., Xn ) of size n, where each Xk is assumed to be Normal (N(µ, σ 2 )), Independent and Identically Distributed (NIID), denoted by: M : Xk ∽ NIID(µ, σ 2 ), −∞ cα ; H0 ) = α. For instance, cα for α = .025 is 1.96; see figures 1(a)-(d). We also have: P (observing a p-value ≤ α) ≤ α. In the simple Fisherian test, the p-value indicates the level of inconsistency between what is expected and what is observed in the sense that the smaller the p-value the larger the discordance between x0 and H0 [Cox, 1958]. If the p-value is not small, if it is larger than some threshold α (e.g., .01) then the disagreement is not considered strong enough to indicate evidence of departures from H0 . Such a result is commonly said to be insignificantly different from H0 , but, as we will see, it is fallacious to automatically view it as evidence for H0 . If the p-value is small enough, the data are regarded as grounds to reject or find a discrepancy from the null. Evidence against the null suggests evidence for some discrepancy from the null, although it is not made explicit in a simple Fisherian test. Reference to ‘discrepancies from the null hypothesis’ leads naturally into NeymanPearson [1933] territory. Here, the falsity of H0 is defined as H1 the complement of H0 with respect to the parameter space Θ. In terms of the p-value, the NeymanPearson (N-P) test may be given as a rule: if p(x0 ) ≤ α, reject H0 (infer H1 ); if p(x0 ) > α, do not reject H0 . Equivalently, the test fixes cα at the start as the cut-off point such that any outcome smaller than cα is taken to “accept” H0 . Critics often lampoon an automatic-recipe version of these tests. Here the tester is envisioned as simply declaring whether or not the result was statistically significant at a fixed level α, or equivalently, whether the data fell in the rejection region. Attention to the manner in which tests are used, even by Neyman, however, reveals a much more nuanced and inferential interpretation to which these formal test rules are open. These uses (especially in the work of E. Pearson) provide a half-way house toward an adequate inferential interpretation of tests: Accept H0 : statistically insignificant result — “decide” (on the basis of the observed p value) that there is insufficient evidence to infer departure from H0 , and Reject H0 : statistically significant result — “decide” (on the basis of the observed p-value) that there is some evidence of the falsity of H0 in the direction of the alternative H1 .
Error Statistics
167
Although one could view these as decisions, we wish to interpret them as inferences. All the N-P results would continue to hold with either construal. The N-P test rule: Reject H0 iff d(x0 ) > cα , ensures the probability of rejecting (i.e., declaring there is evidence against) H0 when H0 is true — a type I error — is α. Having fixed α, the key idea behind N-P tests is to minimize the probability of a type II error (failing to reject H0 when H1 is true), written as β(µ1 ) : P (d(X) ≤ cα ; µ=µ1 ) = β(µ1 ), for any µ1 greater than µ0 . That is, the test minimizes the probability of finding “statistical agreement” with H0 when some alternative hypothesis H1 is true. Note that the set of alternatives in this case includes all µ1 > µ0 , i.e. H1 is a composite hypothesis, hence the notation β(µ1 ). Equivalently, the goal is to maximize the power of the test, for a fixed cα : P OW (Tα ; µ1 ) = P (d(X) > cα ; µ1 ), for any µ1 greater than µ0 . In the behavioristic construal of N-P tests, these goals are put in terms of wishing to avoid behaving erroneously too often. But, the tests that grow out of the requirement to satisfy the N-P pre-data, long-run desiderata often lead to a uniquely appropriate test, whose error probabilities simultaneously can be shown to satisfy severity desiderata. The severity construal of N-P tests underscores the role of error probabilities as measuring the ‘capacity’ of the test to detect different discrepancies γ ≥ 0 from the null, where µ1 =(µ0 +γ). The power of a ‘good’ test is expected to increase with the value of γ. Pre-data, these desiderata allow us to ensure two things: (i) a rejection indicates with severity some discrepancy from the null, and (ii) failing to reject the null rules out with severity those alternatives against which the test has high power. Post-data, one can go much further in determining the magnitude γ of discrepancies from the null warranted by the actual data in hand. That will be the linchpin of our error statistical construal. Still, even N-P practitioners often prefer to report the observed p-value rather than merely whether the predesignated cut-off for rejection has been reached, because it “enables others to reach a verdict based on the significance level of their choice” [Lehmann, 1993, p. 62]. What will be new in the severity construal is considering sensitivity in terms of the probability of {d(X) > d(x0 )}, under various alternatives to the null rather than the N-P focus on {d(X) > cα }. That is, the error statistical construal of tests will require evaluating this ‘sensitivity’ post-data (relative to d(x0 ), not cα ); see [Cox, 2006]. We now turn to the task of articulating the error-statistical construal of tests by considering, and responding to, classic misunderstandings and fallacies.
168
2.2
Deborah G. Mayo and Aris Spanos
How severity gives an inferential interpretation while scotching familiar fallacies
Suppose the observed p-value is .01. This report might be taken to reject the null hypothesis H0 and conclude H1 . Why? An N-P behavioristic rationale might note that deciding to interpret the data this way would rarely be wrong. Were H0 true, so large a d(x0 ) would occur only 1% of the time. In our inferential interpretation, the fact that the p-value is small (p(x0 ) = .01) supplies evidence for H1 because H1 has passed a severe test: with high probability (1 − p(x0 )) such an impressive departure from H0 would not have occurred if H0 correctly described the data generating procedure. The severity definition is instantiated because: (S-1): x0 accords with H1 , and (S-2): there is a high probability (.99) that a less statistically significant difference would have resulted, were H0 true. This is entirely analogous to the way we reasoned informally about George’s weight. Granted, evidence from any one test might at most be taken as some evidence that the effect is genuine. But after frequent rejections of H0 , H1 passes a genuinely severe test because, were H1 false and the null hypothesis H0 true, we would very probably have obtained results that accord less well with H1 than the ones we actually observed. So the p-value gives the kind of data-dependency that is missing from the coarse N-P tests, and it also lends itself to a severity construal — at least with respect to inferring the existence of some discrepancy from the null. We have an inferential interpretation, but there are still weaknesses we need to get around. A pair of criticisms relating to statistically significant results, are associated with what we may call “fallacies of rejection”.
2.3
Fallacies of rejection (errors in interpreting statistically significant results)
First there is the weakness that, at least on an oversimple construal of tests: (#2)
All statistically significant results are treated the same,
and second, that: (#3)
The p-value does not tell us how large a discrepancy is found.
We could avoid these criticisms if the construal of a statistically significant result were in terms of evidence for a particular discrepancy from H0 (an effect size), that is, for inferring: H:µ > µ1 = (µ0 + γ), (there is evidence of a discrepancy γ). The severity reasoning can be used to underwrite such inferences about particular discrepancies γ ≥ 0 from the null hypothesis, i.e., µ > (µ0 + γ). For each
Error Statistics
169
result we need to show: (a) the discrepancies that are not warranted, and (b) those which are well warranted. The basis for doing so is summarized in (a) and (b): (a) If there is a very high probability of obtaining so large a d(x0 ) (even) if µ ≤ µ1 , then SEV (µ > µ1 ) is low. By contrast: (b) If there is a very low probability of obtaining so large a d(x0 ) if µ ≤ µ1 , then SEV (µ > µ1 ) is high. There are two key consequences. First, two different statistically significant results are distinguished by the inferences they severely warrant (criticism #2). Second, for any particular statistically significant result, the severity associated with µ > µ2 will differ from (be less than) that associated with µ > µ1 , for any µ2 greater than µ1 (criticism #3). Let us illustrate in detail with reference to our test Tα of hypotheses: H0 : µ = 0 vs. H1 : µ > 0. For simplicity, let it be known that σ=2, and suppose n=100, i.e. σx = .2. Let us call a result “statistically significant” if it is statistically significant at the .025 level, i.e., d(x0 ) > 1.96. To address criticism #2, consider three different significant results: d(x0 )=2.0 (x=0.4), d(x0 )=3.0 (x=0.6), d(x0 )=5.0 (x=1.0). Each statistically significant result “accords with” the alternative (µ > 0). So (S-1) is satisfied. Condition (S-2) requires the evaluation of the probability that test Tα would have produced a result that accords less well with H1 than x0 does (i.e. d(X) ≤ d(x0 )), calculated under various discrepancies from 0. For illustration, imagine that we are interested in the inference µ > .2. The three different statistically significant outcomes result in different severity assignments for the same inference µ > .2. Begin with d(x0 )=2.0 (x=0.4). We have: SEV (µ > .2) = P (X < 0.4; µ > .2 is false) = P (X < 0.4; µ ≤ .2 is true). Remember, we are calculating the probability of the event, {X < .4}, and the claim to the right of the “;” should be read “calculated under the assumption that” one or another values of µ is correct. How do we calculate P (X < .4; µ ≤ .2 is true) when µ ≤ .2 is a composite claim? We need only to calculate it for the point µ=.2 because µ values less than .2 would yield an even higher SEV value. The severity for inferring µ > .2, when x=.4 is SEV (µ > .2) = .841. This follows from the fact that the observation x=.4 is one standard deviation (σx =.2) in excess of .2. The probability of the event (X > .4) under the assumption that µ=.2 is .16, so the corresponding SEV is .841. By standardizing the difference (x−µ), i.e. define a standardized Normal random variable Z = (x−µ) ∽ N(0, 1), one can read σx off the needed probabilities from the standard Normal tables. Figures 1(a)-(d) show the probabilities beyond 1 and 2 standard deviation, as well as the .025 and .05 thresholds, i.e. 1.645 and 1.96, respectively.
Deborah G. Mayo and Aris Spanos
0.4
0.4
0.3
0.3
P(Z > 1)=.159 0.2
Density
Density
170
0.2
P(Z > 2) =.023 0.1
0.0
0.1
0
0.0
1
2
Fig 1b. N (0, 1): Right tail probability beyond 2 (two) standard deviations.
0.4
0.4
0.3
0.3 Density
Density
Fig 1a. N (0, 1): Right tail probability beyond 1 (one) standard deviation (SD).
0.2
P(Z > 1.645)=.05
0.1
0
z
z
0.2
0.1
P(Z > 1.96)=.025 0.0
0
1.645
z
Fig 1c. N (0, 1): 5% right tail probability.
0.0
0
1.96
z
Fig 1d. N (0, 1): 2.5% right tail probability.
Figure 1. Tail area probabilities of the standard normal (N (0, 1)) distribution
Now, let us consider the two other statistically significant outcomes, retaining this same inference of interest. When x=.6, we have SEV (µ > .2)=.977, since x=.6 is 2 standard deviation in excess of the µ=.2. When x=1, SEV (µ > .2)=.999, since x=1 is 4 standard deviation in excess of µ=.2. So inferring the discrepancy µ > .2 is increasingly warranted, for increasingly significant observed values. Hence, criticisms #2 and #3 are scotched by employing the severity evaluation. If pressed, critics often concede that one can avoid the fallacies of rejection, but seem to argue that the tests are illegitimate because they are open to fallacious construals. This seems to us an absurd and illogical critique of the foundations of tests. We agree that tests should be accompanied by interpretive tools that avoid fallacies by highlighting the correct logic of tests. That is what the error statistical philosophy supplies. We do not envision computing these assessments each time, nor is this necessary. The idea would be to report severity values corresponding to the inferences of interest in the given problem; several benchmarks for well warranted and poorly warranted inferences would suffice.
Error Statistics
171
Figure 2. Significant result. Severity associated with inference µ > 0.2, with different outcomes x0 . Figure 2 shows three severity curves for test Tα , associated with different out0) comes x0 , where, as before, σ = 2, n = 100, d(x0 ) = (x−µ σx , µ0 = 0 : for d(x0 )=2.0 (x=0.4) : SEV (µ > 0.2)=.841, for d(x0 )=3.0 (x=0.6) : SEV (µ > 0.2)=.977, for d(x0 )=5.0 (x=1.0) : SEV (µ > 0.2)=.999. The vertical line at µ=.2 pinpoints the inference in our illustration, but sliding it along the µ axis one sees how the same can be done for different inferences, e.g., µ>.3, µ>.4, . . . Criticism (#3) is often phrased as “statistical significance is not substantive significance”. What counts as substantive significance is a matter of the context. What the severity construal of tests will do is tell us which discrepancies are and are not indicated, thereby avoiding the confusion between statistical and substantive significance. To illustrate the notion of warranted discrepancies with data x0 , consider figure 3 where we focus on just one particular statistically significant outcome, say x=0.4, and consider different discrepancies γ from 0 one might wish to infer, each represented by a vertical line. To begin with, observe that SEV (µ > 0)=.977, i.e. 1 minus the p-value corresponding to this test. On the other hand, as the discrepancy increases from 0 to .2 the SEV (µ > .2) is a bit lower, but not too bad: .841. We see that the SEV decreases as larger discrepancies from 0 are entertained (remembering the outcome is fixed at x=0.4). An extremely useful benchmark is µ > .4, since that is the inference which receives severity .5. So we know immediately that SEV (µ > .5) is less than .5, and in particular it is .3. So
172
Deborah G. Mayo and Aris Spanos
x=0.4 provides a very poor warrant for µ > .5. More than half the time such a significant outcome would occur even if µ ≤ .5.
Fig. 3: SigniÞcant result - The severity for inferring di erent
Figure 3. Significant result. The severity for inferring different discrepancies µ > γ with the same outcome x=0.4 Many general relationships can be deduced. For example, since the assertions µ > µ1 and µ ≤ µ1 constitute a partition of the parameter space of µ we have: SEV (µ > µ1 ) = 1 − SEV (µ ≤ µ1 ). As before, severity is evaluated at a point µ1 , i.e. SEV (µ > µ1 ) = P (d(X) ≤ d(x0 ); µ=µ1 ). Severity and Power with Significant Results: two key ponts (i) It is important to note the relationship between our data-specific assessment of an α-level statistically significant result and the usual assessment of the power of test Tα at the alternative: µ1 =(µ0 + γ). Power, remember, is always defined in terms of a rejection rule indicating the threshold (cα ) beyond which the result is taken as statistically significant enough to reject the null; see section 2.1. If d(x0 ) is then significant at the α-level, d(x0 ) > cα , the severity with which the test has passed µ > µ1 is: P (d(X) ≤ cα ; µ=µ1 ) = 1 − P OW (Tα ; µ1 ). But the observed statistically significant d(x0 ) could exceed the mere cut-off value for significance cα . Should we take a result that barely makes it to the cut-off
Error Statistics
173
Figure 4. Juxtaposing the power curve with the severity curve for x = .4. just the same as one farther out into the rejection region? We think not, and the assessment of severity reflects this. As is plausible, the more significant result yields a higher the severity for the same inference µ > µ1 : P (d(X) ≤ d(x0 ); µ > µ1 ) exceeds P (d(X) ≤ cα ; µ > µ1 ). That is, one minus the power of the test at µ1 provides a lower bound for the severity associated with the inference µ > µ1 . The higher the power of the test to detect discrepancy γ, the lower the severity associated with the inference: µ > (µ0 + γ) when the test rejects H0 . Hence, the severity with which alternative µ > (µ0 + γ) passes a test is not given by, and is in fact inversely related to, the test’s power at: µ1 =(µ0 + γ). This can be seen in figure 4 which juxtaposes the power curve with the severity curve for x=0.4. It is seen that the power curve slopes in the opposite direction from the severity curve. As we just saw, the statistically significant result, x=0.4, is good evidence for µ > .2 (the severity was .841), but poor evidence for the discrepancy µ > .5 (the severity was .3). If the result does not severely pass the hypothesis µ > .5, it would be even less warranted to take it as evidence for a larger discrepancy, say µ > .8. The relevant severity evaluation yields: P (d(X) ≤ 2.0; µ=.8) = .023, which is very low, but the power of the test at µ=.8 is very high, .977. Putting numbers aside, an intuitive example makes the point clear. The smaller the mesh of a fishnet, the more capable it is of catching even small fish. So being given the report that (i) a fish is caught, and (ii) the net is highly capable of catching even 1 inch guppies, we would deny the report is good evidence of, say, a 9 inch fish! This takes us to our next concern.
174
2.4
Deborah G. Mayo and Aris Spanos
Fallacies arising from overly sensitive tests
A common complaint concerning a statistically significant result is that for any discrepancy from the null, say γ ≥ 0, however small, one can find a large enough sample size n such that a test, with high probability, will yield a statistically significant result (for any p-value one wishes). (#4)
With large enough sample size even a trivially small discrepancy from the null can be detected.
A test can be so sensitive that a statistically significant difference from H0 only warrants inferring the presence of a relatively small discrepancy γ; a large enough sample size n will render the power P OW (Tα ; µ1 =µ0 + γ) very high. To make things worse, many assume, fallaciously, that reaching statistical significance at a given level α is more evidence against the null the larger the sample size (n). (Early reports of this fallacy among psychology researchers are in Rosenthal and Gaito, 1963). Few fallacies more vividly show confusion about significance test reasoning. A correct understanding of testing logic would have nipped this fallacy in the bud 60 years ago. Utilizing the severity assessment one sees that an αsignificant difference with n1 passes µ > µ1 less severely than with n2 where n1 > n2 . For a fixed type I error probability α, increasing the sample size decreases the type II error probability (power increases). Some argue that to balance the two error probabilities, the required α level for rejection should be decreased as n increases. Such rules of thumb are too tied to the idea that tests are to be specified and then put on automatic pilot without a reflective interpretation. The error statistical philosophy recommends moving away from all such recipes. The reflective interpretation that is needed drops out from the severity requirement: increasing the sample size does increase the test’s sensitivity and this shows up in the “effect size” γ that one is entitled to infer at an adequate severity level. To quickly see this, consider figure 5. It portrays the severity curves for test Tα , σ=2, n=100, with the same outcome d(x0 )=1.96, but based on different sample sizes (n=50, n=100, n=1000), indicating that: the severity for inferring µ > .2 decreases as n increases: for n=50 : for n=100 : for n=1000 :
SEV (µ > 0.2)=.895, SEV (µ > 0.2)=.831, SEV (µ > 0.2)=.115.
The facts underlying criticism #4 are also erroneously taken as grounding the claim: “All nulls are false.” This confuses the true claim that with large enough sample size, a test has power to detect any discrepancy from the null however small, with the false claim that all nulls are false.
Error Statistics
175
Figure 5. Severity associated with inference µ>0.2, d(x0 )=1.96, and different sample sizes n. The tendency to view tests as automatic recipes for rejection gives rise to another well-known canard: (#5)
Whether there is a statistically significant difference from the null depends on which is the null and which is the alternative.
The charge is made by considering the highly artificial case of two point hypotheses such as: µ=0 vs. µ=.8. If the null is µ=0 and the alternative is µ=.8 then x=0.4 (being 2σx from 0) “rejects” the null and declares there is evidence for .8. On the other hand if the null is µ=.8 and the alternative is µ=0, then observing x=0.4 now rejects .8 and finds evidence for 0. It appears that we get a different inference depending on how we label our hypothesis! Now the hypotheses in a N-P test must exhaust the space of parameter values, but even entertaining the two point hypotheses, the fallacy is easily exposed. Let us label the two cases: Case 1: H0 : µ=0 vs. H1 : µ=.8,
Case 2: H0 : µ=.8 vs. H1 : µ=0.
In case 1, x=0.4 is indeed evidence of some discrepancy from 0 in the positive direction, but it is exceedingly poor evidence for a discrepancy as large as .8 (see figure 2). Even without the calculation that shows SEV (µ > .8)=.023, we know that SEV (µ > .4) is only .5, and so there are far less grounds for inferring an even larger discrepancy5 . 5 We obtain the standardized value by considering the sample mean (x=.4) minus the hypothesize µ (.8), in standard deviation units (σx =.2), yielding z= − 2, and thus P (Z < − 2)=.023.
176
Deborah G. Mayo and Aris Spanos
In case 2, the test is looking for discrepancies from the null (which is .8) in the negative direction. The outcome x=0.4 (d(x0 )=-2.0) is evidence that µ ≤ .8 (since SEV (µ ≤ .8)=.977), but there are terrible grounds for inferring the alternative µ=0! In short, case 1 asks if the true µ exceeds 0, and x=.4 is good evidence of some such positive discrepancy (though poor evidence it is as large as .8); while case 2 asks if the true µ is less than .8, and again x=.4 is good evidence that it is. Both these claims are true. In neither case does the outcome provide evidence for the point alternative, .8 and 0 respectively. So it does not matter which is the null and which is the alternative, and criticism #5 is completely scotched. Note further that in a proper test, the null and alternative hypotheses must exhaust the parameter space, and thus, “point-against-point” hypotheses are at best highly artificial, at worst, illegitimate. What matters for the current issue is that the error statistical tester never falls into the alleged inconsistency of inferences depending on which is the null and which is the alternative. We now turn our attention to cases of statistically insignificant results. Overly high power is problematic in dealing with significant results, but with insignificant results, the concern is the test is not powerful enough.
2.5
Fallacies of acceptance: errors in interpreting statistically insignificant results (#6)
Statistically insignificant results are taken as evidence that the null hypothesis is true.
We may call this the fallacy of interpreting insignificant results (or the fallacy of “acceptance”). The issue relates to a classic problem facing general hypothetical deductive accounts of confirmation: positive instances “confirm” or in some sense count for generalizations. Unlike logics of confirmation or hypothetico-deductive accounts, the significance test reasoning, and error statistical tests more generally, have a very clear basis for denying this. An observed accordance between data and a null hypothesis “passes” the null hypothesis, i.e., condition (S-1) is satisfied. But such a passing result is not automatically evidence for the null hypothesis, since the test might not have had much chance of detecting departures even if they existed. So what is called for to avoid the problem is precisely the second requirement for severity (S-2). This demands considering error probabilities, the distinguishing feature of an error statistical account. Now the simple Fisherian significance test, where the result is either to falsify the null or not, leaves failure to reject in some kind of limbo. That is why Neyman and Pearson introduce the alternative hypothesis and the corresponding notion of power. Consider our familiar test Tα . Affirming the null is to rule out a discrepancy γ > 0. It is unwarranted to claim to have evidence for the null if the test had little capacity (probability) of producing a worse fit with the null even though the null is false, i.e. µ > 0. In the same paper addressing Carnap, Neyman makes this
Error Statistics
177
point (p. 41)6 , although it must be conceded that it is absent from his expositions of tests. The severity account makes it an explicit part of interpreting tests (note 0) that d(x0 ) = (x−µ σx ): (a) If there is a very low probability that d(x0 ) would have been larger than it is, even if µ exceeds µ1 , then µ ≤ µ1 passes the test with low severity, i.e. SEV (µ ≤ µ1 ) is low. By contrast: (b) If there is a very high probability that d(x0 ) would have been larger than it is, were µ to exceed µ1 , then µ ≤ µ1 passes the test with high severity, i.e. SEV (µ ≤ µ1 ) is high. To see how formal significance tests can encapsulate this, consider testing H0 : µ=0 vs. H1 : µ > 0, and obtaining a statistically insignificant result: d(x0 ) ≤ 1.96. We have (S-1): x0 agrees with H0 since d(x0 ) ≤ 1.96.
(S-1): x0 agrees with H0 since d(x0 ) ≤ 1.96. We wish to determine if it is good evidence for µ ≤ µ1 , where µ1 =µ0 + γ, by evaluating the probability that test Tα would have produced a more significant result (i.e. d(X) > d(x0 )), if µ > µ1 : SEV (Tα , x0 , µ ≤ µ1 )=P (d(X) > d(x0 ); µ > µ1 ). It suffices to evaluate this at µ1 =µ0 +γ because the probability increases for µ > µ1 . So, if we have good evidence that µ ≤ µ1 we have even better evidence that µ ≤ µ2, , where µ2 exceeds µ1 (since the former entails the latter). Rather than work through calculations, it is revealing to report several appraisals graphically. Figure 6 shows severity curves for test Tα , where σ=2, n=100, based on three different insignificant results: d(x0 )=1.95 (x=.392), d(x0 )=1.5 (x =.3), d(x0 )=.50 (x = .1). As before, let a statistically significant result require d(x0 )>1.96. None of the three insignificant outcomes provide strong evidence that the null is precisely true, but what we want to do is find the smallest discrepancy that each rules out with severity. For illustration, we consider a particular fixed inference of the form (µ ≤ µ1 ), and compare severity assessments for different outcomes. The low probabilities 6 In
the context where H0 had not been “rejected”, Neyman insists, it would be “dangerous” to regard this as confirmation of H0 if the test in fact had little chance of detecting an important discrepancy from H0 , even if such a discrepancy were present. On the other hand if the test had appreciable power to detect the discrepancy, the situation would be “radically different.” Severity logic for insignificant results has the same pattern except that we consider the actual insignificant result, rather than the case where data just misses the cut-off for rejection.
178
Deborah G. Mayo and Aris Spanos
Figure 6. Insignificant result. Severity associated with inference µ ≤ .2 with different outcomes x0 . associated with the severity assessment of µ ≤ .2 indicates that, in all three cases, the claim that the discrepancy is µ ≤ .2 is unwarranted (to the degrees indicated): for d(x0 )=1.95 (x=.39) : for d(x0 )=1.5 (x=0.3) : for d(x0 )=0.5 (x=0.1) :
SEV (µ ≤ 0.2) = .171, SEV (µ ≤ 0.2) = .309, SEV (µ ≤ 0.2) = .691.
So it would be fallacious (to different degrees) to regard these as warranting µ ≤ 0.2. To have a contrast, observe that inferring µ ≤ .6 is fairly warranted (to different degrees) for all three outcomes: for d(x0 )=1.95 (x=.39) : for d(x0 )=1.5 (x=0.3) : for d(x0 )=0.5 (x=0.1) :
SEV (µ ≤ 0.6) = .853, SEV (µ ≤ 0.6) = .933, SEV (µ ≤ 0.6) = .995.
Working in the reverse direction, it is instructive to fix a high severity value, say, .95, and ascertain, for different outcomes, the discrepancy that may be ruled out with severity .95. For x=0.39, SEV (µ ≤ .72)=.95, for x=0.3, SEV (µ ≤ .62)=.95, and x=0.1, SEV (µ ≤ .43)=.95. Although none of these outcomes warrants ruling out all positive discrepancies at severity level .95, we see that the smaller the observed outcome x, the smaller is the µ1 value such that SEV (µ ≤ µ1 ) = .95. It is interesting to note that the severity curve associated with d(x0 ) =1.95 virtually coincides with the power curve since cα =1.96 for α=.025. The power of
Error Statistics
179
the test to detect µ1 gives the lower bound for the severity assessment for (µ ≤ µ1 ); this is the lowest it could be when an insignificant result occurs. High power at µ1 ensures that insignificance permits inferring with high severity that µ ≤ µ1 . Thus severity gives an inferential justification for the predesignated power, but it goes further. Once the result is available, it directs us to give a more informative inference based on d(x0 ). P-values are Not Posterior Probabilities of H0 The most well-known fallacy in interpreting significance tests is to equate the pvalue with a posterior probability on the null hypothesis. In legal settings this is often called the prosecutor’s fallacy. Clearly, however: (i)
P (d(X) ≥ d(x0 ); H0 ) is not equal to (ii) P (H0 | d(X) ≥ d(x0 )).
The p-value assessment in (i) refers only to the sampling distribution of the test statistic d(X); and there is no use of prior probabilities, as would be necessitated in (ii). In the frequentist context, {d(X) > 1.96} is an event and not a statistical hypothesis. The latter must assign probabilities to outcomes of an experiment of interest. Could we not regard such events as types of hypotheses or at least predictions? Sure. But scientific hypotheses of the sort statistical methods have been developed to test are not like that. Moreover, no prior probabilities are involved in (i): it is just the usual computation of probabilities of events “calculated under the assumption” of a given statistical hypothesis and model. (It is not even correct to regard this as a conditional probability.) We are prepared to go further: it seems to us an odd way of talking to regard the null hypothesis as evidence for the event {d(X) ≤ 1.96}, or for its high probability. It is simply to state what is deductively entailed by the probability model and hypothesis. Most importantly, the statistical hypotheses we wish to make inferences about are not events; trying to construe them as such involves fallacies and inconsistencies (we return to this in Section 3). Some critics go so far as to argue that despite it being fallacious (to construe error probabilities as posterior probabilities of hypotheses): (#7)
Error probabilities are invariably misinterpreted as posterior probabilities.
Our discussion challenges this allegation that significance tests (and confidence intervals) are invariably used in “bad-faith”. We have put forward a rival theory as to the meaning and rationale for the use of these methods in science: properly interpreted, they serve to control and provide assessments of the severity with which hypotheses pass tests based on evidence. The quantitative aspects arise in the form of degrees of severity and sizes of discrepancies detected or not. This rival theory seems to offer a better explanation of inductive reasoning in science.
180
2.6
Deborah G. Mayo and Aris Spanos
Relevance for finite samples
Some critics charge that because of their reliance on frequentist probability: (#8)
Error statistical tests are justified only in cases where there is a very long (if not infinite) series of repetitions of the same experiment.
Ironically, while virtually all statistical accounts appeal to good asymptotic properties in their justifications, the major asset of error statistical methods is in being able to give assurances that we will not be too far off with specifiable finite samples. This is a crucial basis both for planning tests and in critically evaluating inferences post-data. Pre-data the power has an important role to play in ensuring that test Tα has enough ‘capacity’, say (1−α), to detect a discrepancy of interest γ for µ1 =µ0 +γ. To ensure the needed power one often has no other option but to have a large enough sample size n. How large is ‘large enough’ is given by solving the probabilistic equation: to ensure: P (d(X) > cα ; µ1 )=(1−β), set: n={[(cα − cβ )σ]/γ}2 , where cβ is the threshold such that P (Z ≤ cβ )=β for Z ∽N(0, 1). Numerical example. Consider test Tα with µ0 =0, α=.025, σ=2, and let the substantive discrepancy of interest be γ=.4. Applying the above formula one can determine that the sample size needed to ensure high enough power, say (1 − β)=.90, to detect such a discrepancy is: n={[(1.96+1.28)(2)] /(.4)}2 ≈ 262, i.e., the test needs 262 observations to have .9 power to detect discrepancies γ ≥ .4. If the sample size needed for informative testing is not feasible, then there are grounds for questioning the value of the inquiry, but not for questioning the foundational principles of tests. This points to a central advantage of the error statistical approach in avoiding the limitations of those accounts whose reliability guarantees stem merely from asymptotic results, that is for n going to infinity. In particular, a test is consistent against some alternative µ1 when its power P (d(X) > cα ; µ1 ) goes to one as n goes to infinity. This result, however, is of no help in assessing, much less ensuring, the reliability of the test in question for a given n. Considering discrepancies of interest restricts the latitude for test specification, not only in choosing sample sizes, but in selecting test statistics that permit error probabilities to be ‘controlled’ despite unknowns. We now turn to this.
2.7
Dealing with “Nuisance” Parameters
In practice, a more realistic situation arises when, in the above simple Normal model (M), both parameters, µ and σ 2 , are unknown. Since the primary inference concerns µ and yet σ 2 is needed to complete the distribution, it is often called a “nuisance” parameter. Given the way nuisance parameters are handled in this
Error Statistics
181
approach, changes to the testing procedure are rather minor, and the reasoning is unchanged. That is why we were able to keep to the simpler case for exposition. To illustrate, consider EXAMPLE 2. Test Tα∗ , when σ 2 is unknown.
√
0) First, the test statistic now takes the form: d∗ (X)= n(X−µ , s P n 1 2 (X − X) is the sample variance, that provides an unbiased where s2 = n−2 k k=1 estimator of σ 2 . Second, the sampling distributions of d∗ (X), under both the null (H0 ) and the alternative (H1 ) hypotheses, are no longer Normal, but Student’s t; see [Lehmann, 1986]. What matters is that the distribution of d∗ (X) under H0 does not involve the nuisance parameter σ 2 ; the only difference is that one needs to use the Student’s t instead of the Normal tables to evaluate cα corresponding to a particular α. The distribution of d∗ (X) under H1 does involve σ 2 but only through the non√ centrality parameter affecting the power: δ= n(µσ1 −µ0 ) . In practice one replaces σ with its estimate s when evaluating the power of the test at µ1 — likewise for severity. All the other elements of test Tα remain the same for Tα∗ , including the form of the test rule. Ensuring error statistical calculations free of a nuisance parameter is essential for attaining objectivity: the resulting inferences are not threatened by unknowns. This important desideratum is typically overlooked in foundational discussions and yet the error statistical way of satisfying it goes a long way toward answering the common charge that:
(#9)
Specifying statistical tests is too arbitrary.
In a wide class of problems, the error statistician attains freedom from a nuisance parameter by conditioning on a sufficient statistic for it; see [Cox and Hinkley, 1974], leading to a uniquely appropriate test. This ingenious way of dealing with nuisance parameters stands in contrast with Bayesian accounts that require prior probability distributions for each unknown quantity. (Nuisance parameters also pose serious problems for pure likelihood accounts; see [Cox, 2006]). Once this is coupled with the requirement that the test statistics provide plausible measures of “agreement”, the uniquely appropriate test is typically overdetermined: one can take one’s pick for the rationale (appealing to Fisherian, Neyman-Pearsonian, or severity principles).
2.8
Severe Testing and Confidence Interval (CI) Estimation
In CI estimation procedures, a statistic is used to set upper or lower (1-sided) or both (2-sided) bounds. For a parameter, say µ, a (1−α) CI estimation procedure leads to estimates of form: µ = x ± e. Critics of significance tests often allege: (#10)
We should be doing confidence interval estimation rather than significance tests.
182
Deborah G. Mayo and Aris Spanos
Although critics of significance tests often favor CI’s, it is important to realize that CI’s are still squarely within the error statistical paradigm. In fact there is a precise duality relationship between (1−α) CI’s and significance tests: the CI contains the parameter values that would not be rejected by the given test at the specified level of significance [Neyman, 1935]. It follows that the (1−α) one sided interval corresponding to test Tα is: µ>
√ X − cα (σ/ n).
In particular, the 97.5% CI estimator corresponding to test Tα is: √ µ > X−1.96(σ/ n). Now it is true that the confidence interval gives a data-dependent result, but so does our post-data interpretation of tests based on severity. Moreover, confidence intervals have their own misinterpretations and shortcomings that we need to get around. A well known fallacy is to construe (1−α) as the degree of probability to be assigned the particular interval estimate formed, once X is instantiated with x. Once the estimate is formed, either the parameter is or is not contained in it. One can say only that the particular estimate arose from a procedure which, with high probability, (1−α), would contain the true value of the parameter, whatever it is. This affords an analogous “behavioristic” rationale for confidence intervals as we saw with tests: Different sample realizations x lead to different estimates, but one can ensure that (1−α)100% of the time the true parameter value µ, whatever it may be, will be included in the interval formed. Just as we replace the behavioristic rationale of tests with the inferential one based on severity, we do the same with confidence intervals. √ The assertion µ > x − cα (σ/ n) is the one-sided (1−α) interval √ corresponding to the test Tα and indeed, for the particular value µ1 = x − cα (σ/ n), the severity with which the inference µ > µ1 passes test Tα is√(1−α). The severity rationale for applying the rule and inferring µ > x − cα (σ/ n) might go as√follows: Suppose this assertion is false, e.g., suppose µ1 = x−1.96(σ/ n). Then the observed mean is 1.96 standard deviations in excess of µ1 . Were µ1 the mean of the mechanism generating the observed mean, then with high probability (.975) a result less discordant from µ1 would have occurred. (For even smaller values of µ1 this probability is increased.) However, our severity construal also demands breaking out of the limitations of confidence interval estimation. In particular, in the theory of confidence intervals, a single confidence level is prespecified and the one interval estimate corresponding to this level is formed as the inferential report. The resulting interval is sometimes used to perform statistical tests: Hypothesized values of the parameter are accepted (or rejected) according to whether they are contained within (or outside) the resulting interval estimate. The same problems with automatic uses of tests with a single prespecified choice of significance level α reappear in
Error Statistics
183
the corresponding confidence interval treatment. Notably, predesignating a single choice of confidence level, (1−α), is not enough. Here is why: A (1−α) CI corresponds to the set of null hypotheses that an observed outcome would not be able to reject with the corresponding α-level test. In our illustrative example of tests, the null value is fixed (we chose 0), and then the sample mean is observed. But we could start with the observed sample mean, and consider the values of µ that would not be rejected, were they (rather than 0) the null value. This would yield the corresponding CI. That is, the observed mean is not sufficiently greater than any of the values in the CI to reject them at the α-level. But as we saw in discussing severity for insignificant results, this does not imply that there is good evidence for each of the values in the interval: many values in the interval pass test Tα with very low severity with x0 . Yet a report of the CI estimate is tantamount to treating each of the values of the parameter in the CI on a par, as it were. That some values are well, and others poorly, warranted is not expressed. By contrast, for each value of µ in the CI, there would be a different answer to the question: How severely does µ > µ1 pass with x0 ? The severity analysis, therefore, naturally leads to a sequence of inferences, or series of CIs, that are and are not warranted at different severity levels7 .
3
ERROR STATISTICS VS. THE LIKELIHOOD PRINCIPLE
A cluster of debates surrounding error statistical methods, both in philosophy and in statistical practice, reflects contrasting answers to the question: what information is relevant for evidence and inference? Answers, in turn, depend on assumptions about the nature of inductive inference and the roles of probabilistic concepts in inductive inference. We now turn to this. Consider a conception of evidence based just on likelihoods: data x0 is evidence for H so long as: (a) P (x0 ; H) = high (or maximal) (b) P (x0 ; H is false) = low. Although at first blush these may look like the two conditions for severity ((S1) and (S-2)), conditions (a) and (b) together are tantamount to a much weaker requirement: x0 is evidence for H so long as H is more likely on data x0 than on the denial of H — referring to the mathematical notion of likelihood. To see that this is scarcely sufficient for severity, consider a familiar example. 7 A discussion of various attempts to consider a series of CI’s at different levels, confidence curves [Birnbaum, 1961], p-value functions [Poole, 1987], consonance intervals [Kempthorne and Folks, 1971] and their relation to the severity evaluation is beyond the scope of this paper; see [Mayo and Cox, 2006].
184
Deborah G. Mayo and Aris Spanos
Maximally Likely alternatives. H0 might be that a coin is fair, and x0 the result of n flips of the coin. For each of the 2n possible outcomes there is a hypothesis Hi∗ that makes the data xi maximally likely. For an extreme case, Hi∗ can assert that the probability of heads is 1 just on those tosses that yield heads, 0 otherwise. For any xi , P (xi ; H0 ) is very low and P (xi ; Hi∗ ) is high — one need only choose for (a) the statistical hypothesis that renders the data maximally likely, i.e., Hi∗ . So the fair coin hypothesis is always rejected in favor of Hi∗ , even when the coin is fair. This violates the severity requirement since it is guaranteed to infer evidence of discrepancy from the null hypothesis even if it is true. The severity of ‘passing’ Hi∗ is minimal or 0. (Analogous examples are the “point hypotheses” in [Cox and Hinkley, 1974, p. 51], Hacking’s [1965] “tram car” example, examples in [Mayo, 1996; 2008].) This takes us to the key difference between the error statistical perspective and contrasting statistical philosophies; namely that to evaluate and control error probabilities requires going beyond relative likelihoods.
3.1
There is Often Confusion About Likelihoods
The distribution of the sample X assigns probability (or density) to each possible realization x, under some fixed value of the parameter θ, i.e. f (x; θ). In contrast, the likelihood assigns probability (or density) to a particular realization x, under different values of the unknown parameter θ. Since the data x are fixed at x0 and the parameter varies, the likelihood is defined as proportional to f (x; θ) but viewed as a function of the parameter θ: L(θ; x0 ) ∝ f (x0 ; θ) for all θ ∈ Θ. Likelihoods do not obey the probability axioms, for example, the sum of the likelihoods of a hypothesis and its denial is not one. Hacking [1965] is known for having championed an account of comparative support based on what he called the “law of likelihood”: data x0 support hypothesis H1 less than H2 if the latter is more likely than the former, i.e., P (x0 ; H2 ) > P (x0 ; H1 ); when H2 is composite, one takes the maximum of the likelihood over the different values of θ admitted by H2 . From a theory of support Hacking gets his theory of testing whereby, “an hypothesis should be rejected if and only if there is some rival hypothesis much better supported than it is. . . ” [Hacking, 1965, p. 89]. Hacking [1980] distanced himself from this account because examples such as the one above illustrate that “there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” [Barnard, 1972, p. 129]. Few philosophers or statisticians still advocate a pure likelihood account of evidence (exceptions might be [Rosenkrantz, 1977; Sober, 2008], among philosophers, and [Royall, 1997] among statisticians). However, many who would deny that relative likelihoods are all that is needed for inference still regard likelihoods as all that is needed to capture the import of the data. For example, a Bayesian may hold
Error Statistics
185
that inference requires likelihoods plus prior probabilities while still maintaining that the evidential import of the data is exhausted by the likelihoods. This is the gist of a general principle of evidence known as the Likelihood Principle (LP). Disagreement about the LP is a pivot point around which much of the philosophical debate between error statisticians and Bayesians has long turned. Holding the LP runs counter to distinguishing data on grounds of error probabilities of procedures.8 “According to Bayes’s theorem, P (x|µ)...constitutes the entire evidence of the experiment, that is, it tells all that the experiment has to tell. More fully and more precisely, if y is the datum of some other experiment, and if it happens that P (x|µ) and P (y|µ)are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ. . . ”. [Savage, 1962, p. 17] The italicized portion defines the LP. If a methodology allows data to enter only through the likelihoods, then clearly likelihoods contain all the import of the data — for that methodology. The philosophical question is whether relevant information is not thereby being overlooked. The holder of the LP considers the likelihood of the actual outcome, i.e., just d(x0 ), whereas the error statistician needs to consider, in addition, the sampling distribution of d(X) or other statistic being used in inference. In other words, an error statistician could use likelihoods in arriving at (S-1) the condition of accordance or fit with the data, but (S-2) additionally requires considering the probability of outcomes x that accord less well with a hypotheses of interest H, were H false. In the error statistical account, drawing valid inferences from the data x0 that happened to be observed is crucially dependent on the relative frequency of outcomes other than the one observed, as given by the appropriate sampling distribution of the test statistic.
3.2
Paradox of Optional Stopping
The conflict we are after is often illustrated by a two-sided version of our test T . We have a random sample from a Normal distribution with mean µ and standard deviation 1, i.e., Xk ∽ N(µ, 1), k = 1, 2, ..., n, and wish to test the hypotheses: H0 : µ = 0, vs. H1 : µ = 6 0. To ensure √ an overall significance level of .05, one rejects the null whenever |x| > (1.96/ n). However, instead of fixing the sample size in advance, we are to let n 8 A weaker variation on the LP holds that likelihoods contain all the information within a given experiment, whereas the “strong” LP refers to distinct experiments. Here LP will always allude to the strong likelihood principle.
186
Deborah G. Mayo and Aris Spanos
be determined by a stopping rule: √ keep sampling until |x| > (1.96/ n). The probability that this rule will stop in a finite number of trials is 1, regardless of the true value of µ; it is a proper stopping rule. Whereas with n fixed in advance, such a test has a type 1 error probability of .05, with this stopping rule the test would lead to an actual significance level that would differ from, and be greater than .05. This is captured by saying that significance levels are sensitive to the stopping rule; and there is considerable literature as to how to adjust the error probabilities in the case of ‘optional stopping’, also called sequential tests [e.g., Armitage, 1975]. By contrast, since likelihoods are unaffected by this stopping rule, the proponent of the LP denies there really is an evidential difference between the cases where n is fixed and where n is determined by the stopping rule9 . To someone who holds a statistical methodology that satisfies the LP, it appears that: (#11)
Error statistical methods take into account the intentions of the scientists analyzing the data.
In particular, the inference depends on whether or not the scientist intended to stop at n or intended to keep going until a statistically significant differnce from the null was found. The charge in #11 would seem to beg the question against the error statistical methodology which has perfectly objective ways to pick up on the effect of stopping rules: far from intentions “locked up in the scientist’s head” (as critics allege), the manner of generating the data alter error probabilities, and hence severity assessments. As is famously remarked in [Edwards et al., 1963]: “The likelihood principle emphasized in Bayesian statistics implies, . . . that the rules governing when data collection stops are irrelevant to data interpretation. This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels — in the sense of [Neyman and Pearson, 1933, p. 239]. While it may restore “simplicity and freedom” it does so at the cost of being unable to adequately control error probabilities [Berger and Wolpert, 1988; Cox and Hinkley, 1974; Kadane et al., 1999; Mayo and Kruse, 2001; Cox and Mayo, 2010].
3.3
The Reference Bayesians and the Renunciation of the LP
All error probabilistic notions are based on the sampling distribution of a statistic, and thus for an error statistician reasoning from data x0 always depends on considering how a procedure would handle outcomes other than x0 ; it is necessary to consider how often the result would occur in hypothetical repetitions. This conflicts with the likelihood principle (LP). Therefore, objectivity for the error 9 Birnbaum [1962] argued that the LP follows from apparently plausible principles of conditionality and sufficiency. A considerable literature exists, see [Barnett, 1999]. Mayo [2010b] has recently argued that this ”proof” is fallacious.
Error Statistics
187
statistician entails violating the LP — long held as the core of Bayesianism [Mayo, 1983; 1985; 1996]. In fact Bayesians have long argued for foundational superiority over frequentist error statisticians on grounds that they uphold, while frequentists violate, the likelihood principle (LP), leading the latter into Bayesian incoherency. Frequentists have long responded that having a good chance of getting close to the truth and avoiding error is what matters [Cox and Hinkley, 1974]. However, many Bayesian statisticians these days, seem to favor the use of conventionally chosen or “reference” Bayesian priors, both because of the difficulty of eliciting subjective priors, and the reluctance of many scientists to allow subjective beliefs to overshadow the information provided by data. These reference priors however, violate the LP. Over the past few years, leading developers of reference Bayesian methods [Bernardo, 2005; Berger, 2004] concede that desirable reference priors force them to consider the statistical model leading to violations of basic principles, such as the likelihood principle and the stopping rule principle; see [Berger and Wolpert, 1988]. Remarkably, they are now ready to admit that “violation of principles such as the likelihood principle is the price that has to be paid for objectivity” [Berger, 2004]. Now that the reference-Bayesian concedes that violating the LP is necessary for objectivity there may seem to be an odd sort of agreement between the reference Bayesian and the error statistician. Do the concessions of reference Bayesians bring them closer to the error statistical philosophy? To even consider this possibility one would need to deal with a crucial point of conflict as to the basic role of probability in induction. Although Bayesians disagree among themselves about both the interpretation of posterior probabilities, and their numerical values, they concur that: “what you ‘really’ want are posterior probabilities for different hypotheses.” It is well known that error probabilities differ from posteriors. In a variation on the charge of misinterpretation in (#6), critics seek examples where: “p-values conflict with Bayesian posteriors,” leading to results apparently counterintuitive even from the frequentist perspective. We consider the classic example from statistics. (Two-sided) Test of a Mean of a Normal Distribution The conflict between p-values and Bayesian posteriors often considers the familiar example of the two sided T2α test for the hypotheses: H0 :µ = 0, vs. H1 :µ 6= 0. The difference between p-values and posteriors are far less marked with one-sided tests , e.g., [Pratt, 1977; Cassella and Berger, 1987]. Critics observe:
188
Deborah G. Mayo and Aris Spanos
“If n = 50 one can classically ‘reject H0 at significance level p = .05,’ although P (H0 |x) = .52 (which would actually indicate that the evidence favors H0 ).” ([Berger and Sellke, 1987, p. 113], we replace P r with P for consistency.) Starting with a high enough prior probability to the point null (or, more correctly, to a small region around it), they show that an α significant difference can correspond to a posterior probability in H0 that is not small. Where Bayesians take this as problematic for significance testers, the significance testers balk at the fact that use of the recommended priors can result in highly significant results being construed as no evidence against the null — or even “that the evidence favors H0 .” If n=1000, a result statistically significant at the .05 level leads to a posterior to the null of .82! [Berger and Sellke, 1987]. Here, statistically significant results — results that we would regard as passing the non-null hypothesis severely — correspond to an increase in probability from the prior (.5) to the posterior. What justifies this prior? The Bayesian prior probability assignment of .5 to H0 , the remaining .5 probability being spread out over the alternative parameter space, (e.g., recommended by Jeffreys [1939]) is claimed to offer an “objective” assessment of priors: the priors are to be read off from a catalogue of favored “reference” priors, no subjective beliefs are to enter. It is not clear how this negative notion of objectivity secures the assurance we would want of being somewhere close to the truth. The Bayesians do not want too small a prior for the null since then evidence against the null is merely to announce that an improbable hypothesis has become more improbable. Yet the spiked concentration of belief (“weight”) in the null is at odds with the prevailing use of null hypotheses as simply a standard from which one seeks discrepancies. Finally, these examples where p-values differ from posteriors create a tension between the posterior probability in a testing context and the corresponding (highest probability) Bayesian confidence interval: the low posterior indicates scarce evidence against the null even though the null value is outside the corresponding Bayesian confidence interval [Mayo, 2005]. Some examples strive to keep within the frequentist camp: to construe a hypothesis as a random variable, it is imagined that we sample randomly from a population of hypotheses, some proportion of which are assumed to be true. The percentage “initially true” serves as the prior probability for H0 . This gambit commits what for a frequentist would be a fallacious instantiation of probabilities: 50% of the null hypotheses in a given pool of nulls are true. This particular null hypothesis H0 was randomly selected from this pool. Therefore P (H0 is true) = .5. Even allowing that the probability of a randomly selected hypothesis taken from an “urn” of hypotheses, 50% of which are true, is .5, it does not follow that
Error Statistics
189
this particular hypothesis, the one we happened to select, has a probability of .5, however probability is construed [Mayo, 1997; 2005; 2010b].10 Besides, it is far from clear which urn of null hypotheses we are to sample from. The answer will greatly alter whether or not there is evidence. Finally, it is unlikely that we would ever know the proportion of true nulls, rather than merely the proportion that have thus far not been rejected by other statistical tests! Whether the priors come from frequencies or from “objective” Bayesian priors, there are claims that we would want to say had passed severely that do not get a high posterior. These brief remarks put the spotlight on the foundations of current-day reference Bayesians — arguably the predominant form of Bayesianism advocated for science. They are sometimes put forward as a kind of half-way house offering a “reconciliation” between Bayesian and frequentist accounts. Granted, there are cases where it is possible to identify priors that result in posteriors that “match” error probabilities, but they appear to mean different things. Impersonal or reference priors are not be seen as measuring beliefs or even probabilities — they are often improper.11 Subjective Bayesians often quesiton whether the reference Bayesian is not here giving up on the central Bayesian tenets (e.g., [Dawid, 1997; Lindley, 1997]). 4
ERROR STATISTICS IS SELF-CORRECTING: TESTING STATISTICAL MODEL ASSUMPTIONS
The severity assessment of the primary statistical inference depends on the assumptions of the statistical model M being approximately true. Indeed, all model-based statistical methods depend, for their validity, on satisfying the model assumptions, at least approximately; a crucial part of the objectivity of error statistical methods is their ability to be used for this self-correcting goal. Some critics would dismiss the whole endeavor of checking model assumptions on the grounds that: #12
All models are false anyway.
This charge overlooks the key function in using statistical models, as argued by Cox [1995, p. 456]: “... it does not seem helpful just to say that all models are wrong. The very word model implies simplification and idealization. ... The construction of idealized representations that capture important stable 10 The
parallel issue is raised by Bayesian epistemologists; see [Achinstein, 2010; Mayo, 2005; 2010c, 2010d]. 11 Interestingly, some frequentist error statisticians are prepared to allow that reference Bayesian techniques might be regarded as technical devices for arriving at procedures that may be reinterpreted and used by error statisticians, but for different ends (see [Cox, 2006; Cox and Mayo, 2009; Kass and Wasserman, 1996]).
190
Deborah G. Mayo and Aris Spanos
aspects of such systems is, however, a vital part of general scientific analysis.” In order to obtain reliable knowledge of “important stable aspects” of phenomena, tests framed within approximately correct models will do — so long as their relevant error probabilities are close to those calculated. Statistical misspecifications often create sizeable deviations between the calculated or nominal error probabilities and the actual error probabilities associated with an inference, thereby vitiating error statistical inferences. Since even Bayesian results depend on approximate validity of their statistical models, this might be an area for the Bayesian to employ non-Bayesian methods. The error statistician pursues testing assumptions using three different types of tools: informal analyses of data plots, non-parametric and parametric tests, and simulation-based methods, including resampling. Philosophers of science tend to speak of “the data” in a way that does not distinguish the different ways in which a given set of data are modeled, and yet such distinctions are crucial for understanding how reliable tests of assumptions are obtained. In using data to test model assumptions one looks, not at the reduced data in the test statistic (for primary testing), but P rather the full data set x0 :=(x1 , x2 , . . . , xn ). For example, in test Tα , X= n1 nk=1 Xk is a sufficient statistic for parameter µ. That means X, together with its sampling distribution, contains all the information needed for those inferences. However, X, by itself, does not provide sufficient information to assess the validity of the model assumptions underlying test Tα above. There are actually four distinct assumptions (table 1). Table 1 - Simple Normal Model Xk = µ + uk , k∈N, [1] Normality: Xk ∽ N(., .), [2] constant mean: E(Xk ):= µ, k∈N. [3] constant variance: V ar(Xk ):= σ 2 , [4] Independence: {Xk , k∈N} is an independent process. The inferences about µ depend on the assumptions, but the tests of those assumptions should not depend on the unknowns. The idea underlying model validation is to construct Mis-Specification (M-S) tests using ‘distance’ functions whose distribution under the null (the model is valid) is known, and at the same time they have power against potential departures from the model assumptions. M-S tests can be regarded as posing ‘secondary’ questions to the data as opposed to the primary ones. Whereas primary statistical inferences take place within a specified (or assumed) model M, the secondary inference has to put M’s assumptions to the test; so to test M’s assumptions, we stand outside M, as it were. The generic form of the hypothesis of interest in M-S tests is: H0 : the assumptions of statistical model M hold for data x0 ,
Error Statistics
191
as against the alternative not-H0 , which, in general, consists of all of the ways one or more of M’s assumptions can founder. However, this alternative [P − M], where P denotes the set of all possible models that could have given rise to data x0 , is too unwieldy. In practice one needs to consider a specific form of departure from H0 , say Hr , in order to apply a statistical significance test to H0 and test results must be interpreted accordingly. Since with this restricted alternative Hr , the null and alternative do not exhaust the possibilities (unlike in the N-P test), a statistically significant departure from the null would not warrant inferring the particular alternative in a M-S test Hr , at least not without further testing; see [Spanos, 2000]. In M-S testing, the logic of significance tests is this: We identify a test statistic τ (X) to measure the distance between what is observed x0 and what is expected assuming the null hypothesis H0 holds, so as to derive the distribution of τ (X) under H0 . Now the relevant p-value would be: P (τ (X) > τ (x0 ); H0 true) = p, and if it is very small, then there is evidence of violations of the assumption(s) in H0 . We leave to one side here the particular levels counting as ‘small’. A central asset of the error statistical approach to model validation, is its ability to compute the p-value, and other relevant error probabilities, now dealing with erroneous inferences regarding the assumptions. Although the alternative may not be explicit in this simple (Fisherian) test, the interest in determining what violations have been ruled out with severity leads one to make them explicit. This may be done by considering the particular violations from H0 that the given test is capable of probing. This goes beyond what, strictly speaking, is found in standard M-S tests; so once again the severity requirement is directing supplements. The upshot for interpreting M-S test results is this: If the p-value is not small, we are entitled only to rule out those departures that the test had enough capacity to detect. In practice, the alternatives may be left vague or made specific. We consider an example of each, the former with a non-parametric test, the latter with a parametric test.
4.1
Runs Test for IID
An example of a non-parametric M-S test for IID (assumptions [2]–[4]) is the well-known runs test. The basic idea is that if the sample X:=(X1 , X2 , . . . , Xn ) is random (IID), then one can compare the number of runs expected E(R) in a typical realization of an IID sample with the number of runs observed R = r, giving rise to a test statistic: p τ (X) = [R−E(R)]/ V ar(R) ,
whose distribution under IID for n ≥ 20 can be approximated by N(0, 1). The number of runs R is evaluated in terms the residuals: u bk = (Xk − X), k = 1, 2, ..., n,
192
Deborah G. Mayo and Aris Spanos
where instead of the particular value of each observed residual one records its sign, a “+”, or negative, a “−”, giving rise to patterns of the form: ++ − ++ − + + + − + − + −− + + + + + − · · · |{z}|{z}|{z}|{z}| {z }|{z}|{z}|{z}|{z}|{z}| {z }|{z} 1
2
3
4
5
6
7
8
9
10
11
12
The patterns we are interested in are called runs: a sub-sequence of one type (pluses only or minuses only) immediately preceded and succeeded by an element of the other type. The appeal of such a non-parametric test is that its own validity does not depend on the assumptions of the (primary) model under scrutiny: we can calculate the probability of different numbers of runs just from the hypothesis that the assumption of randomness holds. As is plausible, then, the test based on R takes the form: Reject H0 iff the observed R differs sufficiently (in either direction) from E(R) — the expected R under the assumption of IID. The p-value is: P (τ (X) > τ (x0 ); IID true)= p. However, since the test is sensitive to any form of departures from the IID assumptions, rejecting the null only warrants inferring a denial of IID. The test itself does not indicate whether the fault lies with one or the other or both assumptions. Combining this test with other misspecification analyses, however, can; [Mayo and Spanos, 2004].
4.2
A Parameteric Test of Independence
Let us now compare this to a parametric M-S test. We begin by finding a way to formally express the denial of the assumption in question by means of a parameter value in an encompassing model. In particular, the dependence among the Xk ’s may be formally expressed as an assertion that the correlation between any Xi and Xj for i 6= j is non-zero, which in turn may be parameterized by the following AutoRegressive (AR(1)) model: Xk = β0 + β1 Xk−1 + εt , k = 1, 2, ..., n. In the context of this encompassing model the independence assumption in [4] can be tested using the parametric hypotheses: H0 : β1 = 0, vs. H1 : β1 6= 0. Notice that under H0 the AR(1) model reduces to Xk = µ + ut ; see table 1. Rejection of the null based on a small enough p-value provides evidence for a violation of independence. Failing to reject entitles us to claim that the departures against which the test was capable of detecting are not present; see [Spanos, 1999] for further discussion.
Error Statistics
4.3
193
Testing Model Assumptions Severely
In practice one wants to perform a variety of M-S tests assessing different subsets of assumptions [1]–[4]; using tests which themselves rely on dissimilar assumptions. The secret lies in shrewd testing strategies: following a logical order of parametric and non-parametric tests, and combining tests that jointly probe several violations with deliberately varied assumptions. This enables one to argue, with severity, that when no departures from the model assumptions are detected, despite all of these distinct probes, the model is adequate for the primary inferences. The argument is entirely analogous to the argument from coincidence that let us rule out values of George’s weight gain earlier on. To render the probing more effective, error statisticians employ data analytic techniques and data plots to get hints of the presence of potential violations, indicating the most fruitful analytic tests to try. In relation to this some critics charge: #13
Testing assumptions involves illicit data-mining.
The truth of the matter is that data plots provide the best source of information pertaining to potential violations that can be used to guide a more informative and effective probing. Far from being illicit data mining, it is a powerful way to get ideas concerning the type of M-S tests to apply to check assumptions most severely. It provides an effective way to probe what is responsible for the observed pattern, much as a forensic clue is used to pinpoint the culprit; see [Spanos, 2000]. The same logic is at the heart of non-parametric tests of assumptions, such as the runs test.
4.4
Residuals provide the Key to M-S testing
A key difference between testing the primary hypotheses of interest and M-S testing is that they pose very different questions to data in a way that renders the the tests largely independent of each other. This can be justified on formal grounds using the properties of sufficiency and ancillarity; see [Cox and Hinkley, 1974]. It can be shown [Spanos, 2007] that, in many cases, including the above example of the simple Normal model, the information used for M-S testing purposes is independent of the information used in drawing primary inferences. In particular, the distribution of the sample for the statistical model in question simplifies as follows: (3)
f (y; θ) ∝ f (s; θ) · f (r),
n−m ∀ (s, r) ∈Rm . s ×Rr
where the statistics R and S, are not only independent, but S is a sufficient statistic for θ:=(µ, σ 2 ) (the unknown parameters of the statistical model) and R is ancillary for θ, i.e. f (r) does not depend on θ. Due to these properties, the primary inference about θ can be based solely on the distribution of the sufficient
194
Deborah G. Mayo and Aris Spanos
statistic f (s; θ), and f (r) can be used to assess the validity of the statistical model in question. In the case of the simple Normal model (table 1), the statistics R and S take the form: P Pn 1 2 S:=(X, s), where X= n1 nk=1 Xk , s2 = n−1 k=1 (Xk − X) , R:=(b v3 , .., vbn ), vbk =
√
nb uk s
=
√
n(Xk −X) s
∽ St(n−1), k = 3, 4, .., n,
where (b v1 , .., vbn ) are known as studentized residuals; see [Spanos, 2007]. Note that the runs test, discussed above, relies on residuals because it is based on replacing their numerical values with their sign (+ or −). Likewise, the parametric M-S test for independence, placed in the context of the AR(1) model, can be shown to be equivalently based on the auxiliary autoregression in terms of the residuals: u bk = β0 + β1 u bk−1 + εt , k=1, 2, ..., n.
The above use of the residuals for model validation is in the spirit of the strategy in [Cox and Hinkley, 1974] to use the conditional distribution f (x | s) to assess the adequacy of the statistical model. What makes f (x | s) appropriate for assessing model adequacy is that when s is a sufficient statistic for θ, f (x | s) is free of the unknown parameter(s). The simple Poisson and Bernoulli models provide such examples [Cox, 2006, p. 33].
4.5
Further Topics, Same Logic
If one finds violations of the model assumptions then the model may need to be respecified to capture the information not accounted for, but the general discussion of respecification is beyond the scope of this entry; see [Spanos, 2006]. A distinct program of research for the error statistician is to explore the extent to which violations invalidate tests. Thanks to robustness, certain violations of the model assumptions will not ruin the validity of the test concerning the primary hypothesis. Of particular interest in error statistics is a set of computer-intensive techniques known as resampling procedures, including permutation methods, the bootstrap and Monte Carlo simulations, which are based on empirical relative frequencies. Even without knowing the sampling distribution, one can, in effect generate it by means of these techniques. The logic underlying the generation of these simulated realizations is based on counterfactual reasoning: We ask, ‘what would it be like (in terms of sampling distributions of interest) were we to sample from one or another assumed generating mechanism?’ The results can then be used to empirically construct (by “brute force” some claim) the sampling distributions of any statistic of interest and their corresponding error probabilities. This is particularly useful in cases where the sampling distribution of an estimator or a test statistic cannot be derived analytically, and these resampling methods
Error Statistics
195
can be used to evaluate it empirically; see [Efron and Tibshirani, 1993]. The same pattern of counterfactual reasoning around which severity always turns is involved, thus unifying the methods under the error statistical umbrella. Here, however, the original data are compared with simulated replicas generated under a number of different data generating mechanisms, in order to discern the discrepancy between what was observed and “what it would be like” under various scenarios. It should be noted that model specification is distinct from model selection, which amounts to choosing a particular model within a prespecified family of models; see [Spanos, 2010].
BIBLIOGRAPHY [Achinstein, 2010] P. Achinstein. Mill’s Sins or Mayo’s Errors? pp. 170-188 in Error and Inference, (ed.) by D. G. Mayo and A. Spanos, Cambridge University Press, Cambridge, 2010. [Armitage, 1975] P. Armitage. Sequential Medical Trials, 2nd ed, Wiley, NY, 1975. [Barnard, 1972] G. A. Barnard. The Logic of Statistical Inference (review of Ian Hacking The Logic of Statistical Inference), British Journal For the Philosophy of Science, 23: 123-132, 1972. [Barnett, 1999] V. Barnett. Comparative Statistical Inference, 3rd ed.,Wiley, NY, 1999. [Bernardo, 2005] J. M. Bernardo. Reference Analysis, pp. 17–90 in Handbook of Statistics, vol. 25: Bayesian Thinking, Modeling and Computation, D. K. Dey and C. R. Rao, (eds.), Elsevier, North-Holland, 2005. [Berger, 2004] J. Berger. The Case for Objective Bayesian Analysis, Bayesian Analysis, 1, 1–17, 2004. [Berger and Selke, 1987] J. Berger and T. Sellke. Testing a point-null hypothesis: the irreconcilability of significance levels and evidence, Journal of the American Statistical Association, 82: 112-122, 1987. [Berger and Wolpert, 1988] J. Berger and R. Wolpert. The Likelihood Principle. 2d ed., Institute of Mathematical Statistics, Hayward, CA, 1988. [Birnbaum, 1961] A. Birnbaum. Confidence Curves: An Omnibus Technique for Estimation and Testing, Journal of the American Statistical Association, 294: 246-249, 1961. [Birnbaum, 1962] A. Birnbaum. On the Foundations of Statistical Inference (with discussion), Journal of the American Statistical Association, 57: 269-326, 1962. [Birnbaum, 1969] A. Birnbaum. Concepts of Statistical Evidence, pp. 112-143 in S. Morgenbesser, P. Suppes, and M. White (eds.), Philosophy, Science, and Method: Essays in Honor of Ernest Nagel, St. Martin’s Press, NY, 1969. [Casella and Berger, 1987] G. Casella and R. Berger. Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem, Journal of the American Statistical Association, 82: 106-111, 1987. [Cheyne and Worrall, 2006] C. Cheyne and J. Worrall, eds. Rationality and Reality: Conversations with Alan Musgrave, Studies in the History and Philosophy of Science, Springer, Dordrecht, 2006. [Cox, 1958] D. R. Cox. Some Problems Connected with Statistical Inference, Annals of Mathematical Statistics, 29: 357-372, 1958. [Cox, 1977] D. R. Cox. The Role of Significance Tests, (with discussion), Scandinavian Journal of Statistics, 4:49-70, 1977. [Cox, 1995] D. R. Cox. Comment on “Model Uncertainty, Data Mining and Statistical Inference,” by C. Chatfield, Journal of the Royal Statistical Society, A 158: 419-466, 1995. [Cox, 2006] D. R. Cox. Principles of Statistical Inference, Cambridge University Press, Cambridge, 2006. [Cox and Hinkley, 1974] D. R. Cox and D. V. Hinkley. Theoretical Statistics, Chapman and Hall, London, 1974.
196
Deborah G. Mayo and Aris Spanos
[Cox and Mayo, 2010] D. R. Cox and D. G. Mayo. Objectivity and Conditionality in Frequentist Inference, pp. 276-304 in Mayo, D.G. and A. Spanos, Error and Inference, Cambridge University Press, Cambridge, 2010. [Dawid, 1997] A. P. Dawid. Comments on ‘Non-informative priors do not exist”, Journal of Statistical Planning and Inference, 65: 159-162, 1997. [Edwards et al., 1963] W. Edwards, H. Lindman, and L. Savage. Bayesian Statistical Inference for Psychological Research, Psychological Review, 70: 193-242, 1963. [Efron and Tibshirani, 1993] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap, Chapman and Hall, London, 1993. [Fisher, 1925] R. A. Fisher. Statistical Methods for Research Workers, Oliver and Boyd, Edinburgh, 1925. [Fisher, 1935] R. A. Fisher. The Design of Experiments, Oliver and Boyd, Edinburgh, 1935. [Fisher, 1955] R. A. Fisher. Statistical Methods and Scientific Induction, Journal of the Royal Statistical Society, B, 17: 69-78, 1955. [Fisher, 1956] R. A. Fisher. Statistical Methods and Scientific Inference, Oliver and Boyd, Edinburgh, 1956. [Gigerenzer, 1993] G. Gigerenzer. The Superego, the Ego, and the Id in Statistical Reasoning, pp. 311-39 in Keren, G. and C. Lewis (eds.), A Handbook of Data Analysis in the Behavioral Sciences: Methodological Issues, Erlbaum, Hillsdale, NJ, 1993. [Godambe and Sprott, 1971] V. Godambe and D. Sprott, eds. Foundations of Statistical Inference, Holt, Rinehart and Winston of Canada, Toronto, 1971. [Hacking, 1965] I. Hacking. Logic of Statistical Inference, Cambridge University Press, Cambridge, 1965. [Hacking, 1980] I. Hacking. The Theory of Probable Inference: Neyman, Peirce and Braithwaite, pp. 141-160 in D. H. Mellor (ed.), Science, Belief and Behavior: Essays in Honour of R.B. Braithwaite, Cambridge University Press, Cambridge, 1980. [Hull, 1988] D. Hull. Science as a Process: An Evolutionary Account of the Social and Conceptual Development of Science, University of Chicago Press, Chicago, 1988. [Jeffreys, 1939] H. Jeffreys. Theory of Probability, Oxford University Press, Oxford, 1939. [Kadane et al., 1999] J. Kadane, M. Schervish, and T. Seidenfeld. Rethinking the Foundations of Statistics, Cambridge University Press, Cambridge, 1999. [Kass and Wasserman, 1996] R. E. Kass and L.Wasserman. The Selection of Prior Distributions by Formal Rules, Journal of the American Statistical Association, 91: 1343-1370, 1996. [Kempthorne and Folks, 1971] O. Kempthorne and L. Folks. Probability, Statistics, and Data Analysis, The Iowa State University Press, Ames, IA, 1971. [Lehmann, 1986] E. L. Lehmann. Testing Statistical Hypotheses, 2nd edition, Wiley, New York, 1986. [Lehmann, 1993] E. L. Lehmann. The Fisher and Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? Journal of the American Statistical Association, 88: 1242-9, 1993. [Lindley, 1997] D. Lindley. Some comments on “Non-informative priors do not exist”, Journal of Statistical Planning and Inference, 65: 182-189, 1997. [Mayo, 1983] D. G. Mayo. An Objective Theory of Statistical Testing, Synthese, 57: 297-340, 1983. [Mayo, 1985] D. G. Mayo. Behavioristic, Evidentialist, and Learning Models of Statistical Testing, Philosophy of Science, 52: 493-516, 1985. [Mayo, 1996] D. G. Mayo. Error and the Growth of Experimental Knowledge, The University of Chicago Press, Chicago, 1996. [Mayo, 1997] D. G. Mayo. Severe Tests, Arguing From Error, and Methodological Underdetermination, Philosophical Studies, 86: 243-266, 1997. [Mayo, 2003] D. G. Mayo. Could Fisher, Jeffreys and Neyman Have Agreed? Commentary on J. Berger’s Fisher Address, Statistical Science, 18, 2003: 19-24, 2003. [Mayo, 2005] D. G. Mayo. Evidence as Passing Severe Tests: Highly Probed vs. Highly Proved, pp. 95-127 in Scientific Evidence, P. Achinstein (ed.), Johns Hopkins University Press, 2005. [Mayo, 2006a] D. G. Mayo. Philosophy of Statistics, pp. 802-815 in S. Sarkar and J. Pfeifer (eds.), Philosophy of Science: An Encyclopedia, London: Routledge, 2006. [Mayo, 2006b] D. G. Mayo. Critical Rationalism and Its Failure to Withstand Critical Scrutiny, pp. 63-96 in C. Cheyne and J. Worrall (eds.) Rationality and Reality: Conversations with Alan Musgrave, Studies in the History and Philosophy of Science, Springer, Dordrecht, 2006.
Error Statistics
197
[Mayo, 2008] D. G. Mayo. How to Discount Double Counting When It Counts, British Journal for the Philosophy of Science, 59: 857–79, 2008. [Mayo, 2010a] D. G. Mayo. Learning from Error, Severe Testing, and the Growth of Theoretical Knowledge, pp. 28-57 in Mayo, D.G. and A. Spanos, Error and Inference, Cambridge University Press, Cambridge, 2010. [Mayo, 2010b] D. G. Mayo. An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle, pp. 305-314 in Mayo, D.G. and A. Spanos, Error and Inference, Cambridge University Press, Cambridge, 2010. [Mayo, 2010c] D. G. Mayo. Sins of the Epistemic Probabilist Exchanges with Peter Achinstein, pp. 189-201 in Mayo, D.G. and A. Spanos, Error and Inference, Cambridge University Press, Cambridge, 2010. [Mayo, 2010d] D. G. Mayo. The Objective Epistemic Epistemologist and the Severe Tester, in Philosophy of Science Matters: The Philosophy of Peter Achinstei, edited by Gregory Morgan, Oxford University Press, 2010. [Mayo and Kruse, 2001] D. G. Mayo and M. Kruse. Principles of Inference and their Consequences, pp. 381-403 in Foundations of Bayesianism, edited by D. Cornfield and J. Williamson, Kluwer Academic Publishers, Netherlands, 2001. [Mayo and Spanos, 2004] D. G. Mayo and A. Spanos. Methodology in Practice: Statistical Misspecification Testing, Philosophy of Science, 71: 1007-1025, 2004. [Mayo and Spanos, 2006] D. G. Mayo and A. Spanos. Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction, The British Journal for the Philosophy of Science, 57: 323-357, 2006. [Mayo and Cox, 2006] D. G. Mayo and D. R. Cox. Frequentist Statistics as a Theory of Inductive Inference, pp. 77-97 in in Optimality: The Second Erich L. Lehmann Symposium, edited by J. Rojo, Lecture Notes-Monograph Series, vol. 49, Institute of Mathematical Statistics, Beachwood, OH, 2006. [Mayo and Spanos, 2010] D. G. Mayo and A. Spanos. Error and Inference: Recent Exchanges on the Philosophy of Science, Inductive-Statistical Inference, and Reliable Evidence, Cambridge University Press, Cambridge 2010. [Neyman, 1935] J. Neyman. On the Problem of Confidence Intervals, The Annals of Mathematical Statistics, 6: 111-116, 1935. [Neyman, 1952] J. Neyman. Lectures and Conferences on Mathematical Statistics and Probability, 2nd ed. U.S. Department of Agriculture, Washington. 1952. [Neyman, 1956] J. Neyman. Note on an Article by Sir Ronald Fisher, Journal of the Royal Statistical Society, Series B (Methodological), 18: 288-294, 1956. [Neyman, 1957] J. Neyman. Inductive Behavior as a Basic Concept of Philosophy of Science, Revue Inst. Int. De Stat., 25: 7-22, 1957. [Neyman, 1971] J. Neyman. Foundations of Behavioristic Statistics, pp. 1-19 in Godambe and Sprott, eds., 1971. [Neyman and Pearson, 1933] J. Neyman and E. S. Pearson. On the Problem of the Most Efficient Tests of Statistical Hypotheses, Philosophical Transactions of the Royal Society, A, 231: 289-337, 1933. [Neyman and Pearson, 1967] J. Neyman and E. S. Pearson. Joint Statistical Papers of J. Neyman and E. S. Pearson, Unversity of California Press, Berkeley, 1967. [Pearson, 1955] E. S. Pearson. Statistical Concepts in Their Relation to Reality, Journal of the Royal Statistical Society, B, 17: 204-207, 1955. [Pearson, 1962] E. S. Pearson. Some Thoughts on Statistical Inference, Annals of Mathematical Statistics, 33: 394-403, 1962. [Peirce, 1931-5] C. S. Peirce. Collected Papers, vols. 1-6, edited by C. Hartshorne and P. Weiss, Harvard University Press, Cambridge, MA, 1931-5. [Poole, 1987] C. Poole. Beyond the Confidence Interval, The American Journal of Public Health, 77: 195-199, 1987. [Popper, 1959] K. Popper. The Logic of Scientific Discovery, Basic Books, NY, 1959. [Pratt, 1977] J. W. Pratt. Decisions as Statistical Evidence and Birnbaum’s Confidence Concept, Synthese, 36: 59-69, 1977. [Rosenkrantz, 1977] R. Rosenkrantz. Inference, Method and Decision: Towards a Bayesian Philosophy of Science, Reidel, Dordrecht, 1977. [Rosenthal and Gaito, 1963] R. Rosenthal and J. Gaito. The Interpretation of Levels of Significance by Psychological Researchers, Journal of Psychology, 55: 33-38, 1963.
198
Deborah G. Mayo and Aris Spanos
[Royall, 1997] R. Royall. Statistical Evidence: a Likelihood Paradigm, Chapman and Hall, London, 1997. [Savage, 1962] L. Savage, ed. The Foundations of Statistical Inference: A Discussion, Methuen, London, 1962. [Sober, 2008] P. Sober. Evidence and Evolution: The Logic Behind the Science, Cambridge University Press, Cambridge, 2008. [Spanos, 1999] A. Spanos. Probability Theory and Statistical Inference: Econometric Modeling with Observational Data, Cambridge University Press, Cambridge, 1999. [Spanos, 2000] A. Spanos. Revisiting Data Mining: ‘hunting’ with or without a license, The Journal of Economic Methodology, 7: 231-264, 2000. [Spanos, 2007] A. Spanos. Using the Residuals to Test the Validity of a Statistical Model: Revisiting Sufficiency and Ancillarity, Working Paper, Virginia Tech, 2007. [Spanos, 2006] A. Spanos. Where Do Statistical Models Come From? Revisiting the Problem of Specification, pp. 98-119 in Optimality: The Second Erich L. Lehmann Symposium, edited by J. Rojo, Lecture Notes-Monograph Series, vol. 49, Institute of Mathematical Statistics, Beachwood, OH, 2006. [Spanos, 2010] A. Spanos. Akaike-type Criteria and the Reliability of Inference: Model Selection vs. Statistical Model Specification, forthcoming, Journal of Econometrics, 2010.
SIGNIFICANCE TESTING Michael Dickson and Davis Baird Significance tests are supposed to provide inductive support to conclusions regarding processes that involve some element of chance, due to ignorance, sampling, or inherent probability. We sketch their history and logical structure. While significance tests are widely used in the social and physical sciences, they remain controversial. Indeed, there are serious objections to their use, especially in contexts that are not theoretically well-understood, or for which there are not well-established stochastic models.
1
INTRODUCTION
By definition, a probabilistic process allows for some chance variation in its output. For example, samples taken at random from a population of wage earners may vary with respect to some property (a ‘sample statistic’), such as the mean hourly wage. (The probabilistic process in that case is the sampling; the output is the sample; and the property that varies is the mean hourly wage.) Similarly, let the process be ‘flipping a coin ten times’; the output (a sequence of results, ‘H’ or ‘T’) may then vary with respect to the property ‘number of heads’. Suppose that we have an hypothesis about the process, for example, about the distribution of hourly wages within the population, or the chance of getting ‘heads’ on a flip of the coin. While there are hypotheses that can be conclusively refuted by experiment (e.g., ‘the coin always comes up heads’), typical hypotheses about probabilistic processes are not of this sort. We may hypothesize, for example, that the mean hourly wage in the population is $X, or that the coin has probability p (0 < p < 1) to come up ‘heads’. Sampling (or flipping) cannot usually refute these hypotheses conclusively. For example, letting p = 3/4, even a trial of the process that produces a string of 10 ‘tails’ in a row does not refute the hypothesis. The most we can hope for is to find strong inductive evidence against the hypothesis. (Of course, intuitively, a string of 10 ‘tails’ is such evidence.) Significance tests are supposed to provide some way of quantifying the strength of such inductive evidence, thus quantifying the inductive ‘significance’ of the evidence. The central idea is to attach high ‘refutational’ significance to evidence (obtained as the result of a trial of the process in question) that the hypothesis makes extremely unlikely. Such is the case, for example, for our hypothesis p = 3/4 above, in light of the evidence of 10 ‘tails’ in a row. (The probability of this evidence on the hypothesis p = 3/4 is roughly 1 in a million.) The pitfalls inherent in this form of argument are many, and do not stem merely from its inductive nature, but from the specific form that the argument takes. After Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
200
Michael Dickson and Davis Baird
reviewing the history and logic of significance testing more carefully (Section 2), we will consider how one might interpret the results of a significance test (Section 3), and then discuss several objections to it (Section 4). We conclude (Section 5) that under some circumstances, the results of a significance test can be part of a strong inductive argument, but that very often the circumstances in which one wishes to appeal to significance tests are exactly those where these tests should not be compelling. 2
2.1
THE DEVELOPMENT AND LOGIC OF SIGNIFICANCE TESTS
The Normal Distribution and the Z-test
The normal distribution (of a variable, x), is given by the formula (x−µ)2 1 e− 2σ2 , ϕ(x) = √ 2πσ where µ is the mean of the variable, and σ its standard deviation. Graphically, the distribution is the familiar ‘bell-shaped’ curve, centered at x = µ, with width 2σ. By the late 18th century, this distribution was known to be the limit (as the number of trials grows without bound) of a binomial distribution (the distribution of results of trials of a process with just two possible outcomes, one having probability p, the other, therefore, 1 − p). It was later used to develop a general ‘theory of errors’ of observation, where a successful’ observation is associated with one value of the binomial variable, and an ‘erroneous’ observation is associated with the other value. The probability of an erroneous observation is thus given by the probability that characterizes the binomial distribution. (Bear in mind that the normal distribution approximates the binomial distribution only for large samples.) The normal distribution became widely used to describe errors of observation, from astronomical observations to the measurements of the chests of Scottish soldiers [Hacking, 1990]. The ubiquitousness of this ‘exponential law of errors’ came under scrutiny, and rightly so, but under the appropriate circumstances, it can indeed be used to estimate the likelihood that a given observational result is due to random error. In contemporary terms, the procedure is as follows. First, two conditions must be in place: (a) the mean and standard deviation of the variable of interest in the population must be known (or hypothesized); (b) the variable of interest must be known (or hypothesized) to be normally distributed within the population. We will take a sample from the population, and the question is: How likely is it that any deviation of the sample mean from the population mean is due to random chance? (If (b) fails, or is not known to hold, its (potential) failure can be mitigated to some extent by choosing samples of sufficiently large size, in cases where the sample means will then be approximately normally distributed for large samples — the case of a binomial distribution is an example, as we mentioned
Significance Testing
201
above. Exactly how large the sample needs to be depends on the details of the case.) Given a sample of observations of some variable, x, we then compute the sample mean, x ¯. The issue, then, is the extent to which we ought to believe that this sample mean should be ‘close’ to the population mean. The ‘null hypothesis’ is the hypothesis that the sample mean is close to the population mean, and that any deviations from the population mean are due to random fluctuation. Does the evidence afforded by the sample speak against the null hypothesis? For a sample size of 1, we may answer this question by computing the ‘z-score’, the number of standard deviations that the sampled value is away from the population mean: x ¯−µ z= σ where µ and σ refer to the population mean and the population standard deviation, respectively. The assumption that the variable x is normally distributed within the population, with mean µ and standard deviation σ, allows a straightforward calculation of the probabilities for various ranges of z-scores by integrating the normal distribution to determine the ‘area under the curve’ in the appropriate region. If the resulting probability is not too low, then one is often willing to chalk up any difference between x ¯ and µ to random error (rather than, say, some other, systematic, cause). If it is ‘too’ low (i.e., below some specified value), then one may wish, on those grounds, to reject the null hypothesis. For larger samples, of size n (n > 1), we have to consider not the standard deviation of x, but the standard deviation of the means of samples of size n (still assuming that x has mean µ and standard deviation σ), and what matters then is how far our sample mean is from the population mean as√measured in terms of this standard deviation. It turns out that this number is σ/ n for samples of size n. Hence in this case the z-score is z=
x ¯−µ √ . σ/ n
We then proceed as above, for this z-score. Although it would not have been described in these terms, the above analysis represents the state of the art by around the turn of the 20th century. It has some obvious, and potentially severe, limitations, most notably the fact that it does not apply to non-normally distributed variables, and it requires one to know the population mean and standard deviation, which are often the objects of investigation, and therefore not given from the start.1 (On the other hand, the simple z-test can be appropriate in some cases. It is widely used, for example, in intelligence testing, 1 An alternative way to think about this test does not require one to know the population mean. One might instead wish to hypothesize the population mean, and then to calculate, based on that hypothesis, the probability that a given sample mean is ‘close to’ the true mean. If the sample is known to be random with respect to the variable of interest, then this test may help one to verify or reject hypotheses about the population mean.
202
Michael Dickson and Davis Baird
where testers believe that they already know the population mean and standard deviation, and may wish to know whether the mean score of a given group of test takers is ‘significantly’ different from the mean, i.e., whether the group’s deviation from the mean can be ‘reasonably’ chalked up to chance variation.)
2.2 2.2.1
Pearson and the Chi-Squared Test Early History and Binomial Tests
While there were precursors (notably Edgeworth’s (1885) attempt to develop a way to control for chance variation; cf. [Baird, 1987]), arguably the first modern example of a significance test was Karl Pearson’s [1900] chi-squared test. Pearson had a specific problem. In the early 1890s he had come to realize that the normal distribution does not always (or even very often) fit actual chance data. (Pearson makes this observation in numerous places, including several times about data collected by Galton in Pearson [1930], e.g., at p. 74. One modern real-world discussion of the ubiquity non-normality (in this case, in the context of quality control) is [Pyzdek, 2001].) He therefore developed a system of chance distributions, each obtainable via a variation in the parameters appearing in a ‘generalized probability curve’, a formula that generalizes the normal distribution. Pearson’s assumption — no longer held today — was that chance variation would have to be distributed according to one of the distributions in this system (Baird 1983a). The very existence of alternatives to the normal distribution already raises the question of how well any given data fit a given distribution, and Pearson developed the chi-squared test to solve this problem. While we no longer use Pearson’s system of chance distributions, the chi-squared test remains an important test of the goodness of fit between gathered data and a proposed statistical distribution. To see how it works, consider first the simple case of a chance process with just two possible outcomes (for example, flipping a coin). (What follows is not the chi-squared test itself, but a preliminary illustration of the logic of such tests.) Sampling from this process (flipping the coin) N times yields a probability of N! pr (1 − p)N −r r!(N − r)! for getting a sequence containing r many ‘heads’ and N − r many ‘tails’, where p is the probability of getting a ‘heads’. An hypothesis about the (presumed binomial) distribution in this case amounts to a proposed value for p. A subsequent test of this hypothesis may consist in N samples (flips), the result being a sequence with r many ‘heads’. We may compute the probability of this result, under our proposed value for p, using the formula above. (Note: we are calculating the probability of getting r many heads, not the probability of a specific sequence.) If the probability of the actual result, thus determined, is exceptionally low, we may be inclined to reject the initial hypothesis. For example, if we propose p = 1/2 and subsequently get 20 heads in 100 flips, we may be inclined to reject our initial proposal, in favor
Significance Testing
203
of one that asserts a much lower probability for heads on a single flip, and thus a much higher probability for getting 20 heads in 100 flips. (The probability of getting 20 heads in 100 flips, given p = 1/2, is of order 10−10 ). 2.2.2
The Chi-Squared Test
This ‘significance test’ applies only in the simple case of a binomially distributed variable, and the calculations involved are feasible (or were in Pearson’s time) only for relatively small sample sizes.2 Consider, now, the more general case of a variable with finitely many possible values, and an arbitrary hypothesized distribution, pi , where i indexes the possible values of the variable, and pi is the hypothesized probability that the variable will take the ith value on any given trial. So the null hypothesis in this case is the hypothesis that the distribution of the variable of interest in the population is given by the pi .3 Letting N be the sample size and ni the number of times that the ith value appears in the sample, Pearson defines χ2 =
X (ni − N pi )2 i
N pi
.
(So N pi is the expected frequency of the ith value, given the hypothesis.) Note that this quantity gets larger as the actual data get further from the predicted result. Hence χ2 is a measure, of sorts, of how much the actual data deviate from the prediction. It turns out that, given N and the pi , one can calculate the probability p(χ2 ≥ 2 χO ), i.e., the probability that a chance process described by the probabilities pi will produce the observed deviation χ2O , or ‘worse’ (greater). Much as we might do in the binomial case discussed above, then, we may choose a probability, p, lower than which we are not willing to hold on to the original hypothesis. I.e., if our sample produces a χ2O for which p(χ2 ≥ χ2O ) is lower than p, then we reject the hypothesis. (It is far from clear that Pearson intended the χ2 -test to be understood in this way. He does speak in these probabilistic terms occasionally, but seems also to have considered the main point to be to provide a measure of goodness of fit, rather than a probability to be the basis for the acceptance or rejection of hypotheses. However, regardless of Pearson’s intentions, the χ2 -test is now used in this way. See Baird 1983b.) 2 There are good estimates for larger sample sizes — for example, a binomial distribution is well approximated by a normal distribution for large sample sizes — and these were known in Pearson’s time. However, the restriction to binomial distributions is still required, and remains a very severe limitation. 3 Other sorts of hypothesis may be considered. For example, one may wish to hypothesize only the general form of the population distribution (e.g., that it is Poissonian). Here we consider the simple case of hypothesizing a specific distribution, as it is sufficient to examine the logical points that we wish to discuss.
204
2.2.3
Michael Dickson and Davis Baird
Composite Hypotheses and Independence
Thus far, we have considered so-called ‘simple’ hypotheses, i.e., hypotheses that propose a specific distribution for the variable of interest. Often one is interested, instead, in ‘composite’ hypotheses, which do not propose a specific distribution, but a range of distributions. For example, one common hypothesis is that two variables are probabilistically independent of one another. Consider, for example, a case of sampling two binomial variables, X ∈ {0, 1} and Y ∈ {0, 1}. Each sample produces one of four possible results: h0, 0i, h0, 1i, h1, 0i, h1, 1i. The problem may thus apparently be treated as sampling a single variable with four possible results. The hypothesis of independence, however, does not specify a single distribution for this 4-valued variable, but rather a family of distributions, in each of which the variables X and Y are probabilistically independent. (Of course, similar remarks hold for the multinomial case.) It is far from clear what the relationship should be between significance tests for each of many simple hypotheses (to which the χ2 -test most clearly applies) and inferences regarding the composite hypothesis comprised of these simple hypotheses. One may, on the one hand, wish to suppose that if each of the individual simple hypotheses should be rejected, then so should the composite hypothesis — after all, it is merely a disjunction of the individually rejected simple hypotheses. On the other hand, recall that rejection here is a probabilistic judgement: we reject an hypothesis because it makes the observed data very unlikely. Does the fact that each of several hypotheses makes the observed data unlikely entail that the group as a whole ought to be rejected? It is far from clear that the answer is ‘yes’ — that answer lies perilously close to the lottery paradox. In some cases — including the important case of hypotheses of independence — this issue can be, at least for the moment, sidestepped. One may use observed data to make estimates of specific probabilities, in accord with the composite hypothesis (for example, the hypothesis of independence). The result is a simple hypothesis that can be tested in the usual way by a χ2 -test. A case of historical, theoretical, and practical importance is the ‘fourfold contingency table’ (nowadays called a ‘2 × 2’ table), where we wish to judge the independence of the two variables X and Y , as described above. We begin with the observed frequencies, described abstractly in Table 1.
Y =0 Y =1 Total
X=0 a c a+c
X=1 b d b+d
Total a+b c+d a+b+c+d
Table 1. Observed data for two binomial variables. The variables a, b, c, d are arbitrary natural numbers. We may use this data to generate estimates of the four possible results, but in so
Significance Testing
205
doing, we must also respect the hypothesis of independence. Hence, for example, the joint probability p(h0, 0i) must be the product of marginals: a+c a+b p(h0, 0i) = p(X = 0) × p(Y = 0) = . a+b+c+d a+b+c+d The result is a simple hypothesis about the underlying (population) distribution, shown in Table 2. (Note that it is sufficient to calculate just one cell from Table 2, as the others can be determined from the result of that calculation together with the marginal totals.) X=0
X=1
Total
Y =0
(a+c)(a+b) a+b+c+d
(b+d)(a+b) a+b+c+d
a+b
Y =1
(a+c)(c+d) a+b+c+d
(b+d)(c+d) a+b+c+d
c+d
Total
a+c
b+d
a+b+c+d
Table 2. Expected (predicted) data for two independent binomial variables. A standard χ2 -test may then be applied to the observed and hypothesized data in Tables 1 and 2. We discuss this case further below (Section 2.4).
2.3
Gossett and the T -test
In 1899, William Gossett went to work for the Guinness brewery. One of the problems that he took on there was the selection of the best yielding varieties of barley. The problem is statistical in nature because one has, to begin, a chance process — the raising of barley (the variable of interest is the yield). In order to learn about the latest methods for estimating population statistics (in this case, yield) from sample statistics (measured yield of a given variety), Gossett spent two terms (1906-07) in Pearson’s statistics laboratory. Pearson’s methods were what Gossett needed, almost. As we noted above, the χ2 -test works well only for large sample sizes. However, Gossett was working with relatively small sample sizes, and thus needed to develop a new method. (Pearson had not faced this problem — he was largely concerned with biometrics, where one typically has very large sample sizes.) The result was a new significance test, suitable for small sample sizes, the t-test. Because Guinness was concerned
206
Michael Dickson and Davis Baird
with its employees revealing trade secrets, Gossett published this work under the pseudonym ‘Student’ [Gossett, 1908], and hence the test is often called “Student’s t-test”. As was common at the time (and remains so), Gossett assumed that the variable of interest is distributed normally in the population. For large sample sizes, this assumption more or less assures one that the sample means taken from the population will be distributed normally, which fact makes the reliable determination of the probability (given an hypothesis) of the actual data straightforward, if one has a good estimate of the standard deviation in the population. However, for small sample sizes, the sample in general affords a relatively poor estimate of the population standard deviation. Gossett’s main accomplishment was to define a statistic whose distribution could be determined exactly, even when the population standard deviation is being estimated from a relatively small sample. This statistic is denoted ‘t’, and is defined by t=
x ¯−µ s
where x ¯ is the sample mean, s is the sample standard deviation, and µ is the hypothesized population mean. The statistic t measures ‘how many sample standard deviations the observed (sample) mean is away from the hypothesized mean’. Given the hypothesis, larger values of t are increasingly less probable. As we mentioned, Gossett was able to determine (by “guessing” — his own word) a probability distribution for this statistic. Hence, given a hypothesized mean (and the assumption that the population is normally distributed), a sample mean, and a sample standard deviation, one can determine the probability that t is equal to or greater than some observed value for t. The novelty here is that this probability tracks the probability that the hypothesis confers on observed sequences even for small sample sizes. Gossett did not have a mathematically rigorous proof for the distribution of t. In 1912, as a Cambridge undergraduate, R. A. Fisher, provided the proof (published as [Fisher, 1915]). However, it was recognized by both Gossett and Fisher that similar results for other statistics were much in need. Fisher had tackled the problem of correlation coefficients, but in his [1922], partly prompted by Gossett (in private communication), he acknowledged that several important cases remained to be resolved for small samples, including multivariate correlation coefficients, regression coefficients, and others. Many of these cases were later solved by Fisher, using (geometric) methods similar to the one he used to derive the t-distribution. Another direction of extension is raised by the question to what extent these various distributions depend on the assumption that the population is normally distributed. Gossett had, in 1923, tried to interest Fisher in the problem of ‘robustness’ (as it would later be called), to no avail. He did the same with E.S. Pearson (son of Karl) 3 years later. E.S. Pearson and his coworkers began testing the robustness of Gossett’s and Fisher’s results for non-normal distributions
Significance Testing
207
experimentally (using tables of random numbers to perform simulated random sampling from non-normal distributions), ultimately showing which of the results were robust, and which not. The t-distribution is robust, but some of the other distributions derived by Fisher are not. Theoretical work confirming the results of the simulations occurred much later (see, notably, [Box, 1953]). (On the other hand, Fisher was apparently offended by the very idea that the point even needed consideration. There was some friction, as a result, between him and E.S. Pearson. Gossett somehow managed to stay friendly with both of them. See [Lehmann, 1999].)
2.4
Fisher’s Degrees of Freedom
Pearson’s χ2 -test is used in a variety of circumstances. In some cases, theory provides one with a well-defined hypothesized distribution, and in such cases, all agree that the test should proceed as we describe above. However, frequently one uses the data itself to provide estimates for the parameters of an hypothesized type of distribution (in particular, the probabilities or frequencies for the various values of the variable of interest). Indeed, Pearson himself pioneered methods for estimating the parameters of an hypothesized distribution from the observed data. Above (Section 2.2.3), we saw a case where a composite hypothesis was convertible to a simple hypothesis susceptible to a χ2 -test by estimating frequencies. In such cases, it is unclear exactly how many parameters of the distribution are being tested for ‘goodness of fit’. For example, one of the probabilities that is being estimated is fixed by the estimates for the others, because the probabilities of all possible values must sum to 1. A specific example is both historically important and illustrative. Yule and Greenwood [1915] came across this issue in an analysis of the effectiveness of cholera vaccination. The (null) hypothesis of interest here is the composite hypothesis that the ‘vaccinated’ variable is independent of the ‘contracts disease’ variable. As discussed in Section 2.2.3, in order to apply the χ2 -test in its usual form, they had to estimate the relevant probabilities, using the data that they had collected (rather than consulting some theory that is independent of that data). Following essentially the procedure illustrated in Tables 1 and 2, they then employed the χ2 -test, using the distribution of χ2 for a 4-valued variable. They also performed a more traditional statistical analysis (a ‘standard error’ analysis) under the assumption that the populations are distributed normally (an assumption not, of course, required by the χ2 -test). The results differ. While there was already in the literature (largely due to Pearson) ample reason to be suspicious of the assumption of normality, this discrepancy remained a thorn in the side of the χ2 -test as described by Pearson. Yule and Greenwood themselves observed that the standard error analysis agreed with the χ2 -test under the assumption that the variable of interest has only 2 (rather than 4) possible values, and they proposed an explanation for the discrepancy, which was later given a stronger mathematical foundation by Fisher, who was able to derive an exact
208
Michael Dickson and Davis Baird
distribution for χ2 even under the circumstance that the parameters describing the distribution of the variable of interest in the population are estimated from the data, rather than known, or given by theory. Fisher’s result is thus similar to Gossett’s (which demonstrates an exact distribution for t in the population, even though t contains an estimate of the population standard deviation). It turns out that in the case considered by Yule and Greenwood, the distribution of χ2 , when the population (specifically, frequency) parameters are estimated from the data, is given by Pearson’s distribution for χ2 under the assumption of a two-valued, rather than a four-valued, variable of interest. Fisher’s result relies on the notion of ‘degrees of freedom’. Recall that in the case of the fourfold contingency table, fixing one cell of the table (by assuming independence) fixes the other three cells. In other words, the hypothesis of independence (plus the use of observed data to estimate frequencies) determines the probability for one value of the (composite) variable under investigation, and simple mathematics fixes the others. This fact strongly suggests that we should take ourselves to be estimating a single binomial parameter describing the compound distribution of the pair of variables, rather than the four parameters illustrated in the table, or the two binomial parameters that gave rise to the inquiry in the first place. Fisher introduced the term ‘degrees of freedom’ to describe the number of parameters required to describe all possible variations from the hypothesis under examination. He further argued that when calculating the probability that a given deviation from the hypothesis is due to chance variation, the proper distribution to use for χ2 is the one obtained by assuming that the number of parameters is equal to the degrees of freedom. Debate and controversy ensued. Pearson was apparently concerned more with measures of goodness of fit, while Fisher was concerned with the probability that an hypothesis confers on an observed statistics. (See [Baird, 1983b].) Pearson thus never acknowledged that Fisher’s ‘degrees of freedom’ is a correct way to choose a distribution for χ2 . Given his understanding of the meaning of χ2 and its ‘distributions’, he may have been correct, but the statistical tradition has clearly followed Fisher on this point.
2.5 2.5.1
Complications in the Application of Significance Tests Data ‘Too Good to be True’
In the early 20th Century, Gregor Mendel’s [1865] work on plant hybridization was ‘rediscovered’, and indeed Mendelian genetics was at the center of ongoing debates between self-proclaimed ‘Darwinists’ and ‘Mendelians’. (Franklin et al. [2008] provide an exhaustive review and analysis of the story to come.) It was therefore natural for Fisher — who hoped and argued for a reconciliation between ‘Mendelians’ and the ‘Darwinists’ — to examine Mendel’s work closely, from a statistical point of view. While taking quite a positive view of Mendel’s work overall, Fisher was critical of a few aspects of it, the most important, for our purposes, being the ‘goodness
Significance Testing
209
of fit’ between Mendel’s reported observations and his theoretical prediction. The experiments in question were aimed at determining the ratio of heterozygous to homozygous individuals (in this case, pea plants) exhibiting a given dominant trait. (We simplify matters here to get to the basic point — see [Franklin et al., 2008] for the gory details, which are in fact quite important to the ultimate judgment one might make about Mendel’s work, but not to the point we wish to make here.) For each individual, Mendel and his associates examined 10 of its progeny for the recessive trait, and when at least one exhibited the recessive trait, they judged the parent to be heterozygous. Mendel’s theory predicts that the ratio of heterozygous to homozygous individuals is 2:1. In one of his experiments, Mendel reports, out of 600 plants, 399 heterozygotes and 201 homozygotes. This data is, of course, extremely close to the theoretically predicted frequencies of 400 and 200. Indeed, Fisher concludes, the data are too good to be true. The value of χ2 in this case is a suspiciously low 7.5 × 10−3 . The probability of getting data this good, if Mendel’s underlying probabilistic (binomial) model is correct, is very low, in the sense that if Mendel’s model is correct, then the probability of finding such a low value for χ2 is almost zero. The issue is further complicated by the fact, observed by Fisher, that if the true ratio in the population is 2:1, then Mendel’s experimental methodology would yield an observed ratio of approximately 1.7:1, as some heterozygous parents will go undetected, because their offspring will all, by chance, have the dominant trait. By Fisher’s lights, this observation makes matters worse for Mendel. He calculated χ2 under the assumption that the expected observed ratio is 1.7:1 (which, again, is what it would be if the 2:1 law is correct), and finds the value to be unacceptably high. In other words, Mendel’s data does not, apparently, fit this expected result very well. Prima facie, it looks very much like Mendel’s data were cooked. Assuming Mendel’s 2:1 law (which is not in doubt, here), we would expect the observed results to be close (but not too close!) to the 1.7:1 ratio. Mendel, however, expected to observe a 2:1 ratio. His data are, in fact, extremely close (too close?) to this 2:1 expectation. It looks for all the world as if the ‘uncooked’ data were somewhat close to the 1.7:1 ratio, then ‘cooked’ to make them very close to the 2:1 ratio. This conclusion is, in fact, the one that Fisher drew (though not being able to malign Mendel, whose work Fisher admired, he speculated that the blame lie with an over-zealous assistant). It is not uncommon to find this conclusion in the literature today as well, though the matter remains controversial. 2.5.2
Some Illustrative Issues With Fisher’s Analysis
The story does not end here. There are several issues remaining to consider. Here we mention three, and refer the reader to [Seidenfeld, 2008] for an extensive discussion. First, much depends on how one organizes the data into ‘samples’. Consider: suppose a fair coin comes up ‘heads’ exactly 500 times in 1000 independent flips.
210
Michael Dickson and Davis Baird
This outcome is rare. However, any specific outcome (i.e., any specific number of ‘heads’) is rare in this case, and it is far from clear that we should throw out the hypothesis of fairness based on a single sample, unless the sample suggests a strong bias (either for or against ‘heads’). But suppose we did 10 samples of 100 flips each, and got exactly 50 ‘heads’ each time. Now matters are beginning to look fishy. While exactly 50% ‘heads’ may not be cause for suspicion in a single sample, its recurrence 10 times should raise eyebrows. The problem here is that there are no compelling principles guiding the individuation of samples (of a process whose trials are independent); but if arbitrary divisions of (independent) trials into multiple samples is permitted, then we are in trouble, because we could easily divide the 1000 flips into samples in such a way that the individual samples have all sorts of unlikely statistical properties, such as nearly always alternating ‘heads’ with ‘tails’. Second, in order for a significance test to provide decent evidence against a given null hypothesis, it is at least helpful, if not compulsory, to know which other hypotheses are in play, and whether they have less troublesome values for χ2 , and whether they are plausible (or implausible) for other reasons. For suppose that a given hypothesis, H, has a particularly improbable value for χ2 , but that all other hypotheses that we are willing to consider also have improbable values for χ2 . Then there is little reason to reject any of them on the grounds of its value for χ2 . However, the real problem here is not whether such hypotheses (predicting a ‘reasonable’ value for χ2 ) exist (generally, mathematically, they do), but whether there is any one hypothesis that stands out as the alternative to the null hypothesis. In general, there is not. In the case of Mendel’s peas, one alternative is that the data were ‘cooked’, but another is that the trials were not, after all, independent and identically distributed (i.i.d.). (Seidenfeld [2008], for example, argues that in fact the trials may not have been i.i.d..) Of course, one may take as the ‘alternative’ the bare denial of the null hypothesis (as perhaps Fisher would have one do), but in that case, one must be extremely modest about what is being asserted at the end — one cannot, for example, claim to have shown that the data were ‘cooked’ (as opposed to, for example, coming from trials that were not i.i.d.). Third, it is not clear why one must choose the χ2 statistic to determine the likelihood that the hypothesis confers on the data. χ2 , like any other statistic, is, mathematically, just a number computed from the observed data. But any outcome of an experiment can be made to appear ‘rare’ when it is redescribed in terms of a (‘suitably’ chosen) statistic. An extreme case here would (in many cases) be to use as one’s p-value the probability that the hypothesis confers on the data itself. (Nevermind for the moment that this number is often very difficult to calculate, which is partly why we opted for statistics in the first place.) In many cases this choice will yield extremely low p-values for any observed data (which, again, is partly why we opted to calculate the probability of a statistic). For example, the hypothesis that a coin is fair confers exactly the same (extraordinarily low) probability on every N = 100 sequence of ‘heads’ and ‘tails’. Hence we are
Significance Testing
211
driven to consider the probability of certain ‘well-chosen’ statistics computed from the data. However, any choice of a statistic introduces additional assumptions into the discussion, and in the end a discrepancy ‘too unlikely to occur by chance’ between the data and an hypothesis may be chalked up to these assumptions. For example, the appropriateness of measuring the ‘likelihood’ of a given sequence of heads and tails by considering the number of heads it contains depends on whether the trials are i.i.d.. If the trials are i.i.d., then we would assign very low probability to a sequence of all heads of any appreciable length (assuming that the probability of ‘heads’ on any given trial is 1/2). If they are not — for example (to take the extreme case), if trials 2 and beyond are guaranteed to match the result of trial 1, then the sequence of all heads would have quite a high probability (1/2, if the probability of ‘heads’ on any given trial is 1/2) Our purpose here is not to resolve these issues, but to raise them. The choices that one makes in designing a significance test can have a significant outcome on the result. 2.5.3
Two Types of Test
Recall that the χ2 -test measures, roughly, ‘how well the observed data fit the hypothesized distribution’. It is insensitive to whether too many of the observed values are ‘too high’ or ‘too low’, counting both cases as ‘misfits’. In general, a test is ‘two-tailed’ when the null hypothesis will be rejected if the value of the test statistic (such as a sample mean) is either too small or too large. Otherwise, the test is ‘one-tailed’. (The terms ‘one-sided’ and ‘two-sided’ are sometimes used, especially when the underlying population distribution is known or suspected to be poorly described as having ‘tails’.) Whether a significance test should be one-tailed or two-tailed is not always clear — it often depends on the ultimate aim of the testing. For example, onetailed tests are often appropriate in cases where one is interested in whether a given intervention has a positive effect on some condition, for example, in medical testing. The point is that one rejects the null hypothesis of chance variation only when the evidence points in favor of the intervention’s improving the condition in question. One will, in that case, also reject the null hypothesis when the evidence happens to point in favor of the intervention’s having a negative effect, but if the plan is to adopt the intervention only if the null hypothesis is rejected, then arguably this failure to detect negative effects is acceptable. However, notice that the actual (numerical) significance levels mean different things in the two cases. In a one-tailed test, failing the test at the 5% level means that the sample statistic was in the top (alternatively, bottom) 5% of values within those predicted by the model. In a two-tailed test, failure means that the sample statistic was either in the top 2.5% or the bottom 2.5% of values within those predicted by the model. This fact can lead to misinterpretation. Consider again the medical testing example, and suppose that one researcher uses a one-tailed test, while the other
212
Michael Dickson and Davis Baird
uses a two-tailed test. They both find that the test statistic has a p-value of 4%. The one-tailed test will reject the null hypothesis in this case, and conclude that the intervention has a positive effect, while the two-tailed test will not reject the null hypothesis. (See [Bland and Bland, 1994] for a medical example along these lines.) 2.5.4
A Common Logic
We pause, here, to notice that significance tests, as we have described them and as they are commonly employed today, are united by a common logic (see [Baird, 1987]). Once a probabilistic process has been identified, significance testing involves four steps. The first is the stipulation of a ‘null hypothesis’, which, generically, states that the probabilistic nature of the process itself is sufficient to account for the results of any sample of the process. In other words, nothing other than the variation that occurs by chance need be invoked to explain the results of any given trial, or the variation in results from one trial to another. Of course, ‘by chance’ must be carefully defined, by means of some probabilistic model of the process, one that specifies the probabilities that underlie the results. ‘Significance’ will then be assessed relative to this model, and it is this model, and only this model, that is the target of inductive conclusions based on a significance test. (To illustrate: Seidenfeld’s [2008] concern about whether Mendel’s trials are i.i.d. is a concern about Fisher’s model, not about his analysis of significance relative to the model that he (and, apparently, Mendel) presumed). For example, the coin-flipping process mentioned above may be modeled as a binomial distribution defined by the parameters p (the probability of getting a heads), N (the number of flips), and the probabilistic independence of the flips. Sampling N hourly wages from a population of workers may be modeled, semiformally, as the random drawing of N balls from an urn containing a given proportion, pn , of balls for each possible hourly wage, wn . The null hypothesis specifies values for the parameters of these models (for example, p = 3/4 in the coin flipping process), in virtue of which it specifies a distribution for some variable or variables of interest (for example, number of ‘heads’, or hourly wage). The second step in significance testing is to find a statistic that appropriately summarizes the results of trials from the process in question. Appropriateness comes down to two things: (1) It must rank all the possible outcomes of the process with regard to how adequately the chance model accounts for their occurrence; (2) the model provided by the null hypothesis must provide well-defined probabilities that order the possible outcomes in the same way. The first demand ensures that evidence collected from trials of the process will bear on the adequacy of the model provided by the null hypothesis. The second demand permits one to compare results from different significance tests meaningfully. Clearly, if these two demands are met by the test statistic, then the outcomes of the process for which the model provided by the null hypothesis cannot readily account will tend to have
Significance Testing
213
low probability. For example, in the coin-flipping case, we can choose ‘frequency of heads’ as our test statistic. The binomial model for which p = 3/4, N = 10, and the flips are independent then supplies probabilities for all possible results. These probabilities rank a result lower the further it is from 7.5 heads. In the case of sampling hourly wages, a natural choice of sample statistic would be the mean hourly wage. A model that specifies a distribution (in the total population) of wages will, by virtue of this distribution, generate probabilities for all possible sample means, and these probabilities will tend to be lower the further the sample mean is away from the hypothesized population mean, though it is crucial to bear in mind that the details of the relationship between the probability of a given sample mean and its distance from the population mean may be highly dependent on the shape of the distribution. We return to this point below. The third step in significance testing is establishing a level of significance. A trial of the process is performed, and the test statistic is calculated. The level of significance of the outcome is the probability (as determined by the model given by the null hypothesis) of any result as bad or worse than the observed result (understood in terms of the ranking provided by the test statistic). The smaller this probability, the stronger the evidence against the null hypothesis. This probability is often referred to as the ‘p-value’ of a test. Finally, the fourth step is to reject, or fail to reject, the null hypothesis. When the observed result of the trial tells significantly against the null hypothesis, then the null hypothesis is rejected. Logically, then, significance testing, in its most conservative form, appears as a kind of ‘inductive modus tollens’. The null hypothesis (the ‘antecedent’) implies a probability distribution, p : R → [0, 1], for some test statistic. If the actual value, r, of the test statistic, together with all values, ∆r , ‘worse than’ it, has ‘sufficiently low’ probability (i.e., if it is lower than some stipulated threshold, τ ) according to this distribution, then the null hypothesis is rejected: Null Hypothesis ⇒ p : R → [1, 0] p(∆r ) < τ ¬Null Hypothesis. We forego comments on the strength of this argument until Section 3. For now, two further observations are crucial to bear in mind. First, as we noted above, it is common, and probably almost always correct, to consider ‘all values worse than r’, rather than just r itself. Otherwise, we could easily falsely reject a null hypothesis based on very good evidence in its favor. Recall the example above: getting exactly 500 ‘heads’ in 1000 flips of a fair coin presumably is not evidence against the coin’s being fair. (On the other hand, getting exactly 500 flips in 1000 trials repeatedly may raise questions not about the parameters of the hypothesized model, but about it’s basic form, for example,
214
Michael Dickson and Davis Baird
the assumption of i.i.d. trials — see Section 2.5.2 above.) The point is that the exact probability of a given single value of the test statistic is a function of how finely the outcome space has been divided. To avoid this misunderstanding of inductive evidence, significance tests look to the class of results as bad or worse than that obtained. Second, if we fail to reject the null hypothesis, we ought not, logically, conclude anything whatsoever about the null hypothesis itself, except that it was not rejected by the given trial. The analogy with modus tollens is apt in this context: just as one would not conclude P from P → Q and Q, so also one ought not conclude that the null hypothesis is true in cases where the disagreement between a trial and the null hypothesis is determined to be ‘statistically insignificant’. We return to this point below.
2.6 2.6.1
Subsequent Developments Neyman and Pearson’s Approach
In Fisher’s approach to significance testing, statistics were chosen largely on the grounds that an exact population distribution could be calculated (from the available data) for them, at least under given assumptions (about the population). But apart from practical considerations, why choose one statistic rather than another? This question was addressed by Jerzy Neyman and E.S. Pearson, initially jointly [Neyman and Pearson, 1933], and later separately (e.g., [Neyman, 1938; 1955; Pearson, 1955; 1962]). To do so, they consider not merely a null hypothesis, but a class of alternatives to it, and consider, amongst all of these hypotheses, two probabilities: the probability of falsely rejecting the null hypothesis (‘Error I’) and the probability of falsely accepting the null hypothesis (‘Error II’). Given a fixed probability for Error I, they defined the ‘best’ test as the one employing whichever hypothesis minimizes the probability of Error II. Neyman and Pearson were able to solve this problem (i.e., to determine which test is best by this definition) for simple cases, and later work extended their results in various ways to more complex cases. As Fisher would point out in his relentless opposition to these ideas, the calculation of the probability of Error I is based on the null hypothesis, without consideration of any alternative hypotheses. However, the calculation of the probability of Error II requires the specification of a class of alternative hypotheses. On Fisher’s view, this fact rendered these probabilities unable to support the acceptance of an hypothesis in light of a test. Indeed, the issue of just what statistical tests are supposed to achieve seems to have been at the center of the controversy between Fisher and Neyman and Pearson. Neyman and Pearson often spoke in terms of ‘acceptance’ and ‘rejection’ of hypotheses, while Pearson argued that statistical tests are meant only to support rejection: “[the calculation of probabilities of Error II] would never have came into consideration in the theory only of tests of significance, had the logic of such tests not been confused with that of acceptance procedures” [Fisher, 1947, 17]. This
Significance Testing
215
difference was formulated by the protagonists in terms of ‘inductive inference’ (Fisher) versus ‘inductive behavior’ (Neyman and Pearson). Fisher’s view was that statistical tests support inductive inferences; and because the probability of Error II is always relative to a chosen class of hypotheses, statistical tests can never support acceptance tout court, but only a kind of ‘conditional’ acceptance that Fisher thought pointless. Neyman’s view was that statistical tests support inductive behavior, and as behavior requires one to accept some hypothesis or other (one must behave as if something were the case), we do the best we can based on our statistical tests, and argue that in the long run we’ll come out alright, probably. (Pearson shied away from this issue, though the phrase does appear even in their early joint work. For the view that ‘inference versus behavior’ was not, however, at the heart of the dispute, and that Pearson therefore remained ‘true’ to the original motivation for the approach, see [Mayo, 1992].) The story is complicated further by what appear to have been fundamental differences in their understandings of probability, modeling, and the relationship between statistical judgements and action, among other things. (See [Hacking, 1965; Kyburg, 1974; Seidenfeld, 1979; Gigerenzer et al., 1989; Lenhard, 2006]. For the view that the differences between them are largely rhetorical rather than substantial, see [Lehman, 1993].)
2.7
The Institutionalization of Significance Tests
Statistical significance has become the gold standard in many academic disciplines. (Gigerenzer [1993] tells the story in the case of psychology.) Many major journals in social science, for example, require — either officially or in practice — that publishable studies demonstrate a statistically significant effect (i.e., the data must depart ‘significantly’ from some specified null hypothesis, for some statistic). Often, the level of significance (commonly, 5% or 1%) is dictated as well. This transformation of social science (and it has spread to other data-intensive disciplines as well) occurred rapidly. By one account, it was complete by 1940 or so: “[statisticians] have already overrun every branch of science with a rapidity of conquest rivaled only by Attila, Mohammed, and the Colorado beetle” [Kendel, 1942, 69]. In the light of vocal critics (see Section 4 below), the issue of whether and to what extent to continue these requirements has been taken up by some major academic societies, such as the American Psychological Association (see [Azar, 1996]), although significance tests continue to play much the same role as they had throughout at least the latter half of the 20th century. 3
(MIS)INTERPRETING SIGNIFICANCE TESTS
As has been frequently noted in the literature, there are numerous ways to misinterpret the results of a significance test. (Our discussion will only scratch the surface. See, for example, [Hacking, 1965; Kyburg, 1974; Seindenfeld 1979] for
216
Michael Dickson and Davis Baird
some classic and extensive discussions.) Carver [1978], for example, lists three common ‘fantasies’ about the p-value delivered by significance tests: 1. p is the probability that the result of the trial (sample) is due to chance. A small value for p indicates that the result is not due to chance, while a relatively large value for p indicates that the result is due to ‘chance alone’. 2. 1 − p is the reliability of the result of the trial (sample). More precisely, it conveys information about replicability: it is the probability of getting the same result if the experiment were repeated. Researchers may claim that a significant difference (between the model provided by the null hypothesis and the observed result) is replicable if 1 − p is large. 3. p is the probability that the null hypothesis is true. None of these interpretations is correct, though research would be far easier if they were (hence the term ‘fantasy’). Indeed, researchers are often seduced into adopting one of these interpretations. Numerous examples have been noted by various critics of significance testing. (See, e.g., [Gigerenzer, 1993; Cohen, 1994; Falk and Greenbaum, 1995] for several citations and discussions of such misunderstandings. Cf. [Nickerson, 2000] for an extensive discussion.) It is instructive to consider why these interpretations are incorrect. We take them each in turn. 1. Strictly, p is the probability of the result as given by the model provided by the null hypothesis. Even in cases where the null hypothesis is the claim that the process in question operates randomly according to some distribution, is this p the probability that the result is due to chance? No! Consider the case of flipping a coin 10 times. Suppose we get 6 heads in 10 flips, and the null hypothesis is that the process is binomially distributed with probability 0.5. The null hypothesis thus assigns a probability of roughly 0.75 to getting either 6 or more, or 4 or fewer, heads. (Hence by any reasonable measure of significance, this result does not tell against the null hypothesis.) But is this number (75%) the probability that our result is due to chance? Certainly not. Indeed, it is hard to know how one could ever go about calculating such a probability. What, for example, is the probability that a covert agent, knowing about our coin flipping experiment, controls the flips and has arranged in advance to make the numbers come out ‘right’ ? What is the probability that the coin is weighted slightly in favor of heads? Or significantly in favor of heads? The fact that the null hypothesis assigned probability 0.75 to the actual result simply does not bear on these other questions. 2. Neither is 1 − p the reliability of the result. If the null hypothesis is true, then p does in fact represent the probability that we will get the same result (with respect to the test statistic), should we perform another trial. But of
Significance Testing
217
course we do not know that the null hypothesis is true, even if p is small. We know only that it has not (yet?) been rejected. We could obtain the result we did (or any string of such results) for reasons that have nothing to do with the truth of the null hypothesis. 3. It is very seductive to think that p is the probability that the null hypothesis is true. Doing so appeals to our desire to make a decision, to come to a conclusion, even if probabilistic, about the null hypothesis (see, e.g., Gigerenzer 1993). But it is also wrong. There is a difference between Bayesian updating (which should after all give the posterior probability of the hypothesis regardless of one’s commitment to Bayesianism writ large) and significance testing — see [Berger and Sellke, 1987] for an example in which after updating on the new evidence (from a random sample), the hypothesis still has probability > 0.5, and yet a standard significance test would reject the hypothesis ‘at the 5% level’. The failure of this interpretation also illustrates why the others are wrong — for example, if high p indicated that the null hypothesis is true, then it would also be the probability of getting the same result on a second trial. In addition to avoiding these interpretive pitfalls, there are other issues to consider. Here we mention two. 1. Subjectivists and relative frequentists interpret the probabilities generated by significance tests differently. Indeed, whatever the pitfalls of subjectivism may be, one advantage (ironically, as significance testing is often associated with frequentism), is that the subjectivist is less likely to make the errors mentioned above. There already is a subjectivist account of the posterior probability of any hypothesis, subsequent to the collection of new evidence, namely, Bayesian updating. (Of course, this observation raises the ugly issue of the status of subjectivist Bayesian priors, but we ignore that issue here.) On the other hand, it is not entirely clear what the relative frequentist ought to say about the probabilities produced by a significance test in at least some cases. The relative frequentist often must appeal to some idealized, infinite, ensemble, or an idealized infinite sequence of trials, in order to make sense of probabilities as long-run limiting relative frequencies. There are (at least) two well-known and much discussed problems with this view: the actual relative frequency in a finite ensemble or sequence can differ arbitrarily from the limiting relative frequency; and even given an infinite ensemble or sequence, there is no guarantee that the required limits exist. Both of these issues raise the question of what grounds the probabilities in this approach — why believe that observed frequencies have anything to do with ‘probabilities’, or indeed that probabilities, understood in this way, exist at all? (One answer to these questions requires the appeal to an applicable theory or model that supplies probabilities, and is supposed to govern the physical processes under examination.)
218
Michael Dickson and Davis Baird
These issues are important in the context of significance testing. How ought a relative frequentist interpret the probability that the model provided by a null hypothesis assigns to a given sample? One may be tempted to say that these probabilities represent the limiting relative frequency of infinitely many samples, or trials of the process, but very often this interpretation is meaningless (not to mention its other problems). For example, suppose that the process of interest is sampling from some finite population (say, all wage earners in the United States). There are only finitely many possible samples, in that case. It is not clear what sense there is in talking about infinitely many samples in this case, since long ‘before’ we ever had infinitely many samples in hand, we would have sampled the entire population in every possible way. Again, one answer to these questions is to say that we are in fact sampling from some theoretical population that is infinite, and whose statistical properties are ultimately given by some theory. For example, suppose we are interested in the mean velocity of randomly selected sets of fundamental particles. The actual number of particles in the universe is (let us suppose) finite, and thus (at some appropriate level of coarse-graining) there are only finitely many possible samples of these particles. But we may imagine an infinite ensemble of particles in which the velocity is distributed according to some physical law. If the actual particles in the universe are in fact governed by this law, then one might interpret long-run relative frequencies in terms of an imagined, or purely theoretical, infinite sequence of samples from the theoretical infinite ensemble. In other words, with a well-established (or at any rate, accepted) theory in place, it may be possible for the relative frequentist to make sense of the probabilities of a significance test in these theoretical terms. This does not settle the issue of what conclusions, if any, to draw from the test, but it helps with the issue of interpretation. 2. The issue of causation poses a different collection of issues. Does a small value for p indicate a causal relationship of some sort? For example, in the case of intelligence testing (Section 2.2), if z is large enough (hence p small enough), ought we conclude that something about the manner in which the sample was selected is causally responsible for the significant difference between the score in the sample and the score in the population? Suffice it to say that the issue of how and when to draw causal inferences from statistical data is thorny, and hotly debated.4 At the very least, we must acknowledge that it will be extremely difficult — perhaps impossible without the appeal to some theory that involves the variable of interest — to identify a specific causal relationship on the basis of a significant difference between a sample and a population, or between two populations. Even the assumption 4 There is a tradition amongst at least some statisticians of setting it aside completely. Box is often quoted in this context — for example: “Essentially, all models are wrong, but some are useful” [Box and Draper, 1987, 424].
Significance Testing
219
that a significant difference must be due to some (perhaps unknown) causal difference is more an article of faith than a well-established principle. (The literature on causal inference in a probabilistic context is vast. Some recent monographs include [Humphreys, 1989; Eells, 1991; Spirtes et al., 2000; Pearl, 2000].) A significance test can provide inductive evidence that in a certain model — typically, a model specifying “random chance”, as spelled out by the model — is not a good explanation for the data at hand. However, this result does not allow one to distinguish between the various alternatives, such as that the data are the result of some causal process, or some different chance model.
4
OBJECTIONS TO SIGNIFICANCE TESTS
There are numerous objections to significance testing in the literature, though surprisingly, the controversial nature of significance testing rarely shows up in textbooks. (For example, Gliner, Leech and Morgan [2002] examined how significance testing was treated in a variety of textbooks in the social sciences and found the treatments to be uniformly uncritical.) Objections have ranged from cautionary notes (“caution is necessary”, [Mohr, 1990, 74]) to warnings of dangerous irrelevance (“Statistical significance is. . . a diversion from the proper objects of scientific study”, [McCloskey and Ziliak, 2008a, 2]) to outright ridicule (“surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students”, [Rozeboom, 1997, 335]). (Some further notable critiques include [Berkson, 1938; Bakan, 1966; Meehl, 1967; 1990; Carver, 1978; Folger, 1989; Shaver, 1993, Thompson, 1993; Cohen, 1994; Krantz, 1999; Altman, 2004; Ziliak and McCloskey, 2004]. There are many others, including several papers found in the edited collection Morrison and Henkel 1970.) We cannot possibly discuss the full range of critiques, as they have been the subject of full-length books. We focus here on some common or important problems and refer the reader to the extensive literature only hinted at in the citations above for full discussion.
4.1
The Logic Fails
While it may be seductive, the argument form suggested in Section 2.5.4 is just not a strong inductive argument (see, e.g., [Baird, 1984]). As we noted there, the general form is: (P1) if p then q has low probability; (P2) in fact, q; (C) therefore, p has low probability (or, more conservatively, q is evidence against p). But despite its similarity with modus tollens (which may be the source of its seductiveness [Seidenfeld, 1979]), this inductive version of modus tollens is a bad argument — not merely deductively invalid, but inductively weak, i.e., not to be endorsed at this level of generality. For consider the following example: (P1) If John is a soccer player, there is a low probability that he is a goal keeper; (P2)
220
Michael Dickson and Davis Baird
in fact, John is a goal keeper; (C) therefore, John is likely not a soccer player (or, his being so is evidence against his being a soccer player). To the contrary, (P2) is quite conclusive evidence in favor of John’s being a soccer player. (Other examples along similar lines are easily constructed.) This point illustrates the general trouble one can get into when drawing analogies between deductive and inductive reasoning. Similar cases are much discussed. For example, Salmon [1984] and others have suggested that explanations can lower the probability of the event or facts to be explained. That is, after having come to know the explanation for fact F , one might find that one’s assessment of the probability of F is lower than it was before. (Of course, there are delicate issues here regarding what one means by the ‘probability of F ’, given that F is presumed to be a fact.) Presumably the analogous thing cannot happen for non-probabilistic explanations — if a proposed explanation entails the denial of F , then presumably the explanation fails. What goes for explanation goes for evidence too: the fact that a null hypothesis confers low probability on some piece of evidence ought not necessarily be taken as evidence against the hypothesis. The example above illustrates why.
4.2
Disagreement with Bayesian Updating
The deviation between the results of a significance test and Bayesian updating (see Section 3) also constitutes an objection against significance testing, because a significance test may reject an hypothesis (in light of some sampling evidence, for example) even if that hypothesis continues, after Bayesian updating, to have high probability. Such will often be the case when the prior probability of the hypothesis is quite high. Consider, for example, a medical test for some condition, C, whose prevalence in the relevant population is very low, say 1 in 100,000. Our hypothesis is that a given patient does not have C. Suppose that the test has sensitivity 90% and specificity 97%. (That is, Pr(+|C) = 0.90 and Pr(−|¬C) = 0.97, where ± indicates a positive/negative test result.) A positive test for C is therefore statistically significant at the 5% level — the ‘null hypothesis’ (that the patient does not have C), together with this information about the test, predicts a probability of less than 0.05 for the positive test result. The logic of significance testing would therefore suggest that we reject the null hypothesis (i.e., assert that the patient has the condition) in the case of a positive test.5 However, Bayesian updating (which in this case clearly provides the correct analysis) yields a posterior probability for the null hypothesis of nearly 1. The cause of the discrepancy is clear: significance testing does not take into account the prior probability of the null hypothesis. It treats the most fantastical, unlikely, null hypothesis on a par with well-established theories. The example makes it clear that this issue is not just an esoteric roadblock 5 To be clear: the point here is about the logic of significance testing. We are not claiming that anyone would in fact make this mistake.
Significance Testing
221
to the full exposition of a logic of discovery. Real decisions must be made on the basis of these such inductive inferences (see Section 4.3), and adopting the ‘wrong’ methodology can be literally fatal. (See [McCloskey and Ziliak, 2008a] — the subtitle of that book, “How the Standard Error Costs Us Jobs, Justice, and Lives”, raises this issue pointedly.) Indeed, one of the sharpest critiques of the recent movement towards ‘evidence-based’ medicine (which relies heavily on statistical analysis, including significance testing) is precisely that the statistical methods available to us in this context (notably, including significance testing) are simply not up to the task of rationally guiding our actions. This fact has even been expressed in Popperian terms: The method of significance testing when properly pursued is, as we note below, a form of Popperian falsificationism, which is not up to the task of grounding medical practice. (See [Daly, 2005] for an account of evidence-based medicine, and [Shahar, 2003] for the point about Popperian methodology.)
4.3
Grounding Practical Reason
As we just suggested, significance testing properly pursued is just a particularly simple form of Popperian falsificationism, in which we reject a null hypothesis in light of ‘evidence’ against it (if even that much inference is licensed — see Sections 4.2 and 4.1), but say nothing about the null hypothesis otherwise. (Popper’s considered position may be somewhat more complex. The literature here is vast. See Popper 1959, 1963, and the exposition in Miller 1994.) Many researchers have commented on the emptiness of a science that is restricted to a simple Popperian methodology: plain falsificationism cannot account for the fact that a primary aim of science, and necessarily so for applied science, is to make positive assertions about the world, assertions on which one can — rationally! — act. (Again, the literature is vast, and the debate perhaps intractable. See, for example, [Jeffrey, 1975; Salmon, 1981; Rothbart, 2007] for the criticism mentioned here, and on the intractability of the debate, [Swann, 1988].) Despite Popper’s understandable epistemological scruples about inductive inference, in the end, the practical value of science must rest on its ability to ground action in a rational way, and this requires some reasonable logic of induction that can provide such grounding. Significance testing cannot do so, when it is restricted to its logically permitted role (if even this much is permitted) of licensing the rejection but not the acceptance of scientific hypotheses. (Recall from Section 2.6.1 that exactly this issue may be at the heart of the dispute between Fisher and Neyman-Pearson.) A further potential problem (with the application of significance testing) is that even if one were to ignore the logical issues mentioned above — indeed, even if one were to set aside Popperian scruples about induction — it is far from clear that the result would be a science of practical value. For example, it is widely recognized that statistical significance and economic significance are independent of one another. (How widely recognized this point is in the current actual practice of economics is a hotly debated issue. See, e.g., [McCloskey, 1998], the highly
222
Michael Dickson and Davis Baird
critical article by Hoover and Siegler 2008a, the reply by [McCloskey and Ziliak, 2008b], and the reply to the reply by [Hoover and Siegler, 2008b].) For example, a statistically significant difference in the effectiveness of two different investment strategies may be economically insignificant. The point here is that statistical significance may by itself be completely insensitive to the relevant units of analysis (e.g., [Snyder and Lawson, 1993; Ziliak and McCloskey, 2004; Kirk, 2007]). It is constrained not by practically relevant numerical measures (costs, levels of effectiveness, and so on), but by such factors as sample sizes and standard deviations. For example, a very large (in practical terms) difference between two populations may be ‘undetectable’ by significance testing if our samples are small. Similarly, a very small (in practical terms) difference may show up as a strongly significant difference (statistically) if the samples are very large. These points have practical import. For example, in Castaneda v. Partida, 430 U.S. 482 (1977), the Supreme Court established a guideline of “2 or 3 standard deviations” (p-values of 0.05 to 0.01) for determining whether an employer has engaged in employment discrimination. (For a discussion of statistical issues in employment litigation, see, e.g., [Zink and Gutman, 2005; Siskin and Trippi, 2005, 136–143].) But this standard clearly has its limitations (as the Court has acknowledged in more recent cases). For example, an employer with only 10 employees may be discriminating, but the size of the population is sufficiently small that only the most egregious cases would be detectable by a significance test. On the other hand, a careful statistical significance test could detect extremely small salary differences (say, differences of only a few dollars per year), or differences (say, 1 in 10,000) in rates of layoffs, among subgroups of employees of a very large employer. Such differences may quite reasonably be judged statistically, but not economically, or even legally, significant. (Would an employer really ‘discriminate’ by paying certain groups only a few dollars less per year, or laying off just 1 in 10,000 ‘extra’ member of a given group, or is some other explanation more likely?) Examples from medicine abound as well. Suppose that in a clinical trial, a cholesterol-lowering drug was found to lower cholesterol by .05mgdL. In a welldesigned trial, this result may be strongly statistically significant, but it is doubtful that the drug’s effect should be taken to be medically significant. Those who oppose significance testing on the grounds that statistical and practical significance differ are apt to advocate for a scientific focus on the magnitude, rather than the bare existence, of differences amongst populations, or differences between samples and the population from which they are taken. This sort of scientific activity requires formal methods other than significance testing. (See, e.g., [Thompson, 1998; Miller and van der Meulen Rodgers, 2008].)
4.4
The Problem of Auxiliary Hypotheses
As has been well established within philosophy of science, the logical gap between simple hypotheses and empirical fact is vast — various auxiliary hypotheses are
Significance Testing
223
always in play. This point goes back at least to Duhem [1914/1954], and was made famous in Anglo-American philosophical circles in a somewhat stronger version by Quine [1951]. In a different guise, the point has been argued at length by Cartwright [1983]. The general idea is easy enough to establish: if one’s theory predicts, for example, that the temperature of a given sample of coffee should be 135 ◦ F by 4:00pm then any test of that theory is necessarily going to involve several auxiliary assumptions, such as that the temperature was indeed measured at 4:00pm (the clock was working), the thermometer is accurate, and so on, including many other, more complex and subtle, theoretical, experimental, technological, logical, and epistemological hypotheses, many of which may go unrecognized. Whenever a piece of empirical evidence threatens to overturn some hypothesis, one must keep in mind that these sorts of auxiliary hypothesis may be to blame instead. Again, the point is not of merely esoteric theoretical interest. To take the most obvious example: the application of significance testing to survey sampling typically requires an assumption that the sample was ‘random’ with respect to the variable of interest. Suppose, for example, that we wished to determine whether a given intervention affects health in statistically significant way. The usual procedure is via a double-blind test, involving a random sample of subjects some of whom are given the intervention, others not. (See [Cartwright, 2007] and references therein for an extended critique of this sort of testing, however.) If the samples are not random, then the results require further scrutiny. There are, of course, many famous examples of ‘random’ samples that were not so [Wheeler, 1976]. The problem of obtaining a random sample is especially intractable in situations where a decent theory of the matter under investigation is unavailable. In that case, one simply does not know which variables are relevant to the variable of interest — which other properties of the units being selected may influence the variable of interest — and thus one does not know how to control for them to produce a truly random sample. It is also worth bearing in mind that in many testing situations, samples (for example, those in need of treatment for some condition) are convenience samples (those willing and available to take part in the treatment), whose relationship to a larger ‘population’ (whatever it may be) with respect to the variables of interest may be quite unclear. Of course, there are plenty of other auxiliary hypotheses that are generally required in an application of significance testing. Sometimes these hypotheses can themselves can be verified to reasonable precision, but they may also be simply articles of faith.
4.5
Which Level of Statistical Significance Matters?
Suppose that, at least for some specified purpose and experimental design, we have agreed that statistical significance testing is the appropriate way to determine whether some evidence supports or disconfirms a given null hypothesis. We must still decide what level of significance we require. It is common to choose either the
224
Michael Dickson and Davis Baird
1% or the 5% level. It is even more common to suppose that this choice matters, but in fact often the difference is itself statistically insignificant. The point is easiest to see in the context of comparing studies. (We have modified the following example from Gelman and Stern [2006], who discuss this point, and the appearance of this mistake in the literature, in further detail.) Consider two studies of the effectiveness of a pest control on crop harvest. In the first, a difference is found between the control group (no pest control applied) and the treatment group (pest control applied) of 25 pounds per acre of harvest. (So the statistic in this case is a difference of sample means.) The standard error is 10 pounds. (The standard error associated with a random sample is the standard deviation of the distribution of all sample means generated by random samples of the same size, from the same population.) This result is thus statistically significant at around the 1% level. In the second study, a difference of 10 pounds per acre is found, with the same standard error of 10 pounds. This result is not statistically significant. What should we conclude from these results? Is one of these studies better than the other? And what if the two tests were for two different pest controls? Should we conclude that the first is effective, but the second not (or not shown to be)? Any difference in one’s judgment of the two tests would presumably rest on the difference in statistical significance achieved in them. However, that difference itself is not statistically significant. The expected difference between the two studies (i.e., the expected value of the difference between the means generated by the two different samples) is 15, with a standard error of roughly 14. In other words, the difference between these two studies is itself not statistically significant. Hence it is unclear what statistical evidence would ground the preference for one over the other. Furthermore, consider a third study, with a larger sample size. Suppose it reveals an effect of 2.5 pounds, with a standard error of 1 pound. The statistical significance of this study is the same as that of the first study, but the difference between them is also statistically significant. Moreover, if we were to consider just the null hypothesis that there is an effect, we might be inclined to say that the third study replicates the first, when in fact the magnitude of the effect found in the third study is quite different from that found in the first study — and this difference is statistically significant! The lesson here is that comparing studies of statistical significance is at best a dicey business. (Note also that the issue of the size of the effect has resurfaced — see Section 4.3.) 5
CONCLUSION
Certainly there are defenders of significance testing (e.g., [Abelson, 1995; 1997; Tukey, 1991]), and although we have focused here on the myriad ways in which significance tests can be misleading, or can be misused or even abused, it would be wrong to say that a well-designed significance test tells one nothing. Exactly how much it does tell one is a matter for further discussion.
Significance Testing
225
We will not engage in that discussion here in detail, except to make two observations. The first is that there is an important similarity between this issue and the issue of the strength of hypothetico-deductive reasoning in general. Of course, views on hypothetico-deduction in science span the range from utter rejection (e.g., Popper) to claims that it is the central form of reasoning in science (e.g., [Lawson, 2000]). Perhaps more interesting for our purposes are those who take some middle path, cautiously endorsing hypothetico-deduction under appropriate circumstances (e.g., [McMullin, 1984]). Such authors must address the question under what conditions an inference to the (likely) truth of an hypothesis is licensed by the observation of consequences predicted by that hypothesis. Such accounts are too subtle and lengthy to be pursued here. (We refer the reader to [McMullin, 1984] for an example.) The similarity to significance testing should be clear — the interpretive and logical problems that we have discussed here with significance testing do not constitute a rejection of the entire enterprise, but present a challenge to specify (either generally, or in the context of a specific case) the conditions under which inferences based on significance testing are inductively strong. This is an open project. This similarity between hypothetico-deductive reasoning and significance testing leads us to our second observation, that the plausibility of such inferences (in both cases) may be greatly enhanced by the theoretical or experimental context in which the inference is made. Significance testing runs into trouble when researchers collect data without much of a clue about the underlying mechanisms that give rise to the data. On the other hand, in situations where a great deal of knowledge about the possible causes of an effect are in place, it is far easier to avoid a mistaken inference or interpretation. However, it must be emphasized that in such cases, a great deal more than statistical significance is at work, and it is fair to wonder just how much inferential work is being done by significance (or lack of it) alone. The discussion above makes it clear, in any case, that bare statistical significance (or lack of it) alone is a very shaky basis on which to draw inferences, despite the siren call of those inferences. ACKNOWLEDGEMENTS Thanks to Prasanta Bandyopadhyay and an anonymous referee for very helpful comments on an earlier draft, which has been much improved because of their remarks. BIBLIOGRAPHY [Abelson, 1995] R. P. Abelson. Statistics as Principled Argument. Hillsdale, NJ: Lawrence Erlbaum, 1995. [Abelson, 1997] R. P. Abelson. On the Surprising Longevity of Flogged Horses: Why There Is a Case for the Significance Test, Psychological Science 8:12–15, 1997. [Altman, 2004] M. Altman. Statistical Significance, Path Dependency, and the Culture of Journal Publication, Journal of Socio-Economics 33:651–663, 2004.
226
Michael Dickson and Davis Baird
[Azar, 1997] B. Azar. APA Task Force Urges a Harder Look at Data, APA Monitor, 28:26, 1997. [Bakan, 1966] D. Bakan. The Test of Significance in Psychological Research, Psychological Bulletin 66:423–437, 1966. [Bakan, 1984] D. Bakan. Tests of Significance Violate the Rule of Implication, in P.D. Asquith and P. Kitcher (eds.), PSA 1984 1:81–92, 1984. [Baird, 1983a] D. Baird. Conceptions of Scientific Law and Progress in Science, in N. Rescher (ed.), The Limits of Lawfulness. Lanham, MD: University Press of America, 33–41, 1983. [Baird, 1983b] D. Baird. The Fisher/Pearson Chi Squared Controversy: A Turning Point for Inductive Inference, The British Journal for the Philosophy of Science 34:105–118 [Baird, 1987] D. Baird. Significance Tests: History and Logic, in N. L. Johnson and S. Kotz (eds.), Encyclopedia of Statistical Sciences, vol. 8. New York: John Wiley & Sons, 466–471, 1987. [Berkson, 1938] J. Berkson. Some Difficulties of Interpretation Encountered in the Application of the Chi-Square Test, Journal of the American Statistical Association 33:526–542, 1938. [Bland and Bland, 1994] J. M. Bland and D. G. Bland. Statistics Notes: One and Two Sided Tests of Significance. British Medical Journal 309:248, 1994. [Box, 1953] G. E. F. Box. Non-normality and Test on Variances, Biometrika 40:318–335, 1953. [Box and Draper, 1987] G. E. P. Box and N. R. Draper. Empirical Model-Building and Response Surfaces. Oxford: John Wiley & Sons, 1987. [Cartwright, 1983] N. Cartwright. How the Laws of Physics Lie. Oxford: The Clarendon Press. 1983. [Cartwright, 2007] N. Cartwright. Are RCTs the Gold Standard?, BioSocieties 2:11–20, 2007. [Carver, 1978] R. P. Carver. The Case Against Statistical Significance Testing, Harvard Educational Review 48:378–399. 1978. [Cohen, 1994] J. Cohen. The Earth is Round (p < .05), American Psychologist 49:997–1003, 1994. [Daly, 2005] J. Daly. Evidence-Based Medicine and the Search for a Science of Clinical Care. California/Milbank Books on Health and the Public, vol. 12. Berkeley, CA: University of California Press, 2005. [Duhem, 1914/1954] P. Duhem. La Th´ eorie Physique Son Objet et Sa Structure. Paris, Chevalier et Rivi` ere. 1914. Translated by Phillip Wiener as The Aim and Structure of Physical Theory. Princeton: Princeton University Press, 1954. [Eells, 1991] E. Eells. Probabilistic Causality. Cambridge: Cambridge University Press, 1991. [Fisher, 1915] R. A. Fisher. Frequency Distribution of the Values of the Correlation Coefficient in Samples from an Indefinitely Large Population, Biometrika 10:507–521, 1915. [Fisher, 1922] R. A. Fisher. On the Mathematical Foundations of Theoretical Statistics, Philosophical Transactions of the Royal Society, London, Series A 222:309–368, 1922. [Fisher, 1936] R. A. Fisher. Has Mendels Work Been Rediscovered?, Annals of Science 1:115– 137, 1936. [Fisher, 1947] R. A. Fisher. The Design of Experiments, Fourth Edition. New York: Hafner Press, 1947. [Folger, 1989] R. Folger. Significance Tests and the Duplicity of Binary Decisions, Psychological Bulletin 106:155–160, 1989. [Franklin et al., 2008] A. Franklin, A. W. F. Edwards, and D. J. Fairbanks. Ending the MendelFisher Controversy. Pittsburgh: University of Pittsburgh Press, 2008. [Gelman and Stern, 2006] A. Gelman and H. Stern. The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant, American Statistician 60: 328–331, 2006. [Gigerenzer, 1993] G. Gigerenzer. The Superego, the Ego, and the Id in Statistical Reasoning, in G. Keren and C. Lewis (eds.), A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues. Hillsdale, NJ: Erlbaum, 311–339, 1993. [Gigerezer et al., 1989] G. Gigerenzer, Z. Swijtink, and T. Porter. The Empire of Chance. Cambridge: Cambridge University Press, 1989. [Gliner et al., 2002] J. A. Gliner, N. L. Leech, and G. A. Morgan. Problems with Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say?, The Journal of Experimental Education 71:83-89, 2002. [Gossett, 1908] W. Gossett. The Probable Error of a Mean, Biometrika 6:1–25. 1908. [Hacking, 1965] I. Hacking. Logic of Statistical Inference. New York: Cambridge University Press, 1965.
Significance Testing
227
[Hacking, 1990] I. Hacking. The Taming of Chance. New York: Cambridge University Press 1990. [Hoover and Siegler, 2008a] K. D. Hoover and M. V. Siegler. Sound and Fury: McCloskey and Significance Testing in Economics, Journal of Economic Methodology 15:1–37, 2008. [Hoover and Siegler, 2008b] K. D. Hoover and M. V. Siegler. The Rhetoric of ‘Signifying Nothing’: A Rejoinder to Ziliak and McCloskey, Journal of Economic Methodology 15:57–68, 208. [Humphreys, 1989] P. Humphreys. The Chances of Explanation: Causal Explanations in the Social, Medical, and Physical Sciences. Princeton: Princeton University Press. 1989. [Jeffrey, 1975] R. C. Jeffrey. Probability and Falsification: Critique of the Popper Program, Synthese 30:95–117. 1975. [Kendall, 1942] M. G. Kendall. On the Future of Statistics Journal of the Royal Statistical Society 105:69–80, 1942. [Kirk, 2007] R. E. Kirk. Effect Magnitude: A Different Focus, Journal of Statistical Planning and Inference 137:1634–1646, 2007. [Krantz, 1999] D. H. Krantz. The Null Hypothesis Testing Controversy in Psychology, Journal of the American Statistical Association 94:1372–1381, 1999. [Kyburg, 1974] H. E. Kyburg. The Logical Foundations of Statistical Inference. Boston: D. Reidel, 1974. [Lawson, 2000] A. Lawson. The Generality of the Hypothetico-Deductive Method: Making Scientific Thinking Explicit, American Biology Teacher 62:482–495, 2000. [Lehman, 1993] E. L. Lehman. The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?, Journal of the American Statistical Association 88:1242–1249, 1993. [Lehman, 1999] E. L. Lehman. ‘Student’ and Small-Sample Theory, Statistical Science 14:418– 426, 1999. [Lenhard, 2006] J. Lenhard. Models and Statistical Inference: The Controversy between Fisher and NeymanPearson, The British Journal for the Philosophy of Science 57:69–91, 2006. [Mayo, 1992] D. G. Mayo. Did Pearson Reject the Neyman-Pearson Philosophy of Statistics?, Synthese 90:233–262, 1992. [McCloskey, 1998] D. N. McCloskey. The Rhetoric of Economics. Madison, WI: University of Wisconsin Press, 1998. [McCloskey and Ziliak, 2008a] D. N. McCloskey and S. Ziliak. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives, 2008. [McCloskey and Ziliak, 2008b] D. N. McCloskey and S. Ziliak. Signifying Nothing: Reply to Hoover and Siegler, Journal of Economic Methodology 15:39–55, 2008. [McMullin, 1984] E. McMullin. A Case for Scientific Realism, in Jarrett Leplin (ed.), Scientific Realism. Berkeley, CA: University of California Press, 8–40, 1984. [Meehl, 1967] P. E. Meehl. Theory Testing in Psychology and in Physics: A Methodological Paradox, Philosophy of Science 34:103–115, 1967. [Meehl, 1990] P. E. Meehl. Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It, Psychological Inquiry 1:108–141 1990. [Mendel, 1909/1865] G. Mendel. Versuche u ¨ber Pflanzen-Hybriden, Verhandlungen des naturforschenden Vereines in Br¨ unn 4:3–47, 1865. Reprinted and translated as Experiments in Plant-Hybridization in Bateson (1909), Mendel’s Principles of Heredity, 318–361, 1909. [Miller, 1994] D. Miller. Critical Rationalism: A Restatement and Defence. Chicago: Open Court, 1994. [Miller and van der Meulen ROdgers, 2008] J. E. Miller and Y. van der Meulen Rodgers. Economic Importance and Statistical Significance: Guidelines for Communicating Empirical Research, Feminist Economics 14:117–149, 2008. [Mohr, 1990] L. B. Mohr. Understanding Significance Testing. Sage University Paper Series on Quantitative Applications in the Social Sciences, 07-073. Newbury Park, CA: Sage, 1990. [Neyman, 1938] J. Neyman. L’Estimation Statistique Trait´ ee Comme in Probl` eme Classique de Probabilit´ e, Actualit´ es Scientifiques et Industrielles 739:25–57, 1938. [Neyman, 1955] J. Neyman. The Problem of Inductive Inference, Communications in Pure and Applied Mathematics 8:13–46, 1955. [Neyman and Pearson, 1933] J. Neyman and E. S. Pearson. On the Problem of the Most Efficient Tests of Statistical Hypotheses, Philosophical Transactions of the Royal Society of London, Series A 231:289–337, 1933. [Nickerson, 2000] R. S. Nickerson. Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy, Psychological Methods 5:241–301, 2000.
228
Michael Dickson and Davis Baird
[Pearl, 2000] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge: Cambridge University Press, 2000. [Pearson, 1900] K. Pearson. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is such thast in Can be Reasonably Supposed to Have Arisen from Random Sampling, reprinted in Pearson (1948), 1900. [Pearson, 1930] K. Pearson. The Life, Letters, and Labours of Francis Galton, vol. 3a. Cambridge: Cambridge University Press. 1930. [Pearson, 1948] K. Pearson. Karl Pearson’s Early Statistical Papers. E. S. Pearson (ed.). Cambridge: Cambridge University Press. 1948. [Pearson, 1955] E. S. Pearson. Statistical Concepts in Their Relation to Reality, Journal of the Royal Statistical Society, Series B 17:204–207, 1955. [Pearson, 1962] E. S. Pearson. Some Thoughts on Statistical Inference, Annals of Mathematical Statistics 33:394–403, 1962. [Popper, 1959] K. Popper. The Logic of Scientific Discovery. London: Hutchinson, 1959. [Popper, 1963] K. Popper. Conjectures and Refutations: The Growth of Scientific Knowledge. London: Routledge, 1963. [Pyzdek, 2001] T. Pyzdek. Non-Normal Distributions in the Real World, Journal of the RAC Second Quarter, 2001:10–17, 2001. [Quine, 1951] W. V. O. Quine. Two Dogmas of Empiricism, The Philosophical Review 60:20–53. 1951. [Rothbart, 2007] D. Rothbart. Popper Against Inductivism, Dialectica 34: 121–128. 2007. [ROzeboom, 1997] W. W. Rozeboom. Good Science is Abductive, Not Hypothetico-Deductive, in L. L. Harlow, S. A. Mulaik, and J. H. Steiger (eds.), What If There Were No Significance Tests?. Hillsdale, NJ: Erlbaum, 335–391, 1997. [Salmon, 1981] W. Salmon. Rational Prediction, British Journal for the Philosophy of Science 32: 115–125, 1981. [Salmon, 1984] W. Salmon. Scientific Explanation and the Causal Structure of the World. Princeton: Princeton University Press, 1984. [Seidenfeld, 1979] T. Seidenfeld. Philosophical Problems of Statistical Inference. Boston: D. Reidel, 1979. [Seidenfeld, 2008] T. Seidenfeld. , P’s in a Pod: Some Recipes for Cooking Mendel’s Data, in Franklin et al. (2008), 215–257, 2008. [Shahar, 2003] E. Shahar. A Popperian Perspective of the Term ‘Evidence-Based Medicine’ , Journal of Evaluation in Clinical Practice 3:109–116, 2003. [Shaver, 1993] J. P. Shaver. What Statistical Significance Testing Is, and What It Is Not, Journal of Experimental Education 61:293–316, 1993. [Siskin and Trippi, 2005] B. R. Siskin and J. Trippi. Employment Discrimination Litigation: Behavioral, Quantitative, and Legal Perspectives, in Frank J. Landy and Eduardo Salas (eds.), Employment Discrimination Litigation: Behavioral, Quantitative, and Legal Perspectives. San Francisco, CA: Wiley, 133–166, 2005. [Snyder and Lawson, 1993] P. Snyder and S. Lawson. Evaluating Results Using Corrected and Uncorrected Effect Size Estimates, Journal of Experimental Education 61:293–316, 1993. [Spirtes et al., 2000] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge, MA: M.I.T. Press, 2000. [Student, 1908] Student [William S. Gosset]. The Probable Error of a Mean, Biometrika 6:1–25, 1908. [Swann, 1988] A. J. Swann. Popper on Induction, British Journal for the Philosophy of Science, 367–373, 1988. [Thompson, 1993] B. Thompson. The Use of Statistical Significance Tests in Research: Bootstrap and Other Alternatives, Journal of Experimental Education 61:361–377, 1993. [Thompson, 1998] B. Thompson. Statistical Significance and Effect Size Reporting: Portrait of a Possible Future, Research in the Schools 5:33–38, 1998. [Tukey, 1991] J. W. Tukey. The Philosophy of Multiple Comparisons, Statistical Science 6:100– 116, 1991. [Wheller, 1976] M. Wheeler. Lies, Damn Lies and Statistics: The Manipulation of Public Opinion in America. New York: Dell Publishing, 1976. [Yule and Greenwood, 1915] G. U. Yule and M. Greenwood. The Statistics of Anti-Typhoid and Anti-Cholera Inoculations and the Interpretation of Such Statistics in General, Royal Society of Medicine Proceedings, Section of Epidemiology and State Medicine 8(II):113–194, 1915.
Significance Testing
229
[Ziliak and McCloskey, 2004] S. Ziliak and D. N. McCloskey. Size Matters: The Standard Error of Regressions in theAmerican Economic Review, Econ Journal Watch 1:331–358, 2004. [Ziliak and McCloskey, 2004] S. Ziliak and D. N. McCloskey. Significance Redux, Journal of Socio-Economics 33:665–675, 2004. [Zink and Gutman, 2005] D. L. Zink and A. Gutman. Statistical Trends in Private Sector Employment Discrimination Suits, in Frank J. Landy and Eduardo Salas (eds.), Employment Discrimination Litigation: Behavioral, Quantitative, and Legal Perspectives. San Francisco, CA: Wiley, 101–131, 2005.
This page intentionally left blank
Bayesian Paradigm Subjective Bayesianism Objective Bayesianism Confirmation Theory and Challenges to it Bayesianism as a Form of “Logic”
This page intentionally left blank
THE BAYESIAN DECISION-THEORETIC APPROACH TO STATISTICS Paul Weirich Statistical inferences rely on probabilities. Probabilities also guide decisions. What implications does probabilities’ role in decisions have for their role in statistical inferences? Bayesian statistics employs probabilities that are relative to evidence. These same probabilities ground decisions. Consequently, Bayesian statistics studies decisions to illuminate the probabilities it uses to analyze statistical inferences. Section 1 explains the Bayesian decision-theoretic approach to statistics. Section 2 describes Bayesian probabilities’ function in representations of decision problems. Section 3 considers how to use the relationship between Bayesian probabilities and decisions to illuminate these probabilities. It takes them to be rational degrees of belief and does not define them in terms of preferences. Section 4 explores the properties of Bayesian probabilities. Its main conclusion is that these probabilities respond only to evidence, and not to any nonevidential goal. Section 5 treats Bayesian conditionalization and presents strategies for meeting objections. Section 6 compares Bayesian and classical statistics. It defends Bayesian statistics against the charge of inordinate subjectivism. 1
BAYESIANISM IN INFERENCE AND DECISION
The term Bayesianism has multiple meanings. This should not be surprising. Ambiguity is a common feature of widely used technical terms. In one familiar sense Bayesianism analyzes statistical inferences and decisions made using probabilities. This section describes the key components of Bayesianism understood this way.
1.1
Bayesian Methods
Bayesianism advances conditionalization as a rule of probabilistic inference and maximization of expected utility as a decision rule. Conditionalization uses the probability of a hypothesis conditional on a possible bit of evidence to update the hypothesis’s probability if that evidence arrives. A decision’s expected utility is a probability-weighted average of the utilities of the decision’s possible outcomes. A decision maximizes expected utility if and only if no alternative decision has greater expected utility. Later sections explain these tenets of Bayesianism. Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
234
Paul Weirich
Bayesianism derives its name from Bayes’s theorem, which shows how to compute the probability of a hypothesis given some evidence. A simple version of the theorem states that if H is a hypothesis, E is evidence, P is a probability function, and P (X|Y ) stands for the probability of X given Y , then P (H|E) = P (E|H)P (H)/P (E). Everyone accepts Bayes’s theorem as a component of probability theory, but classical statisticians do not accept its use in statistical reasoning. In the course of statistical reasoning classical statistics, unlike Bayesian statistics, does not compute the posterior probability of a hypothesis H after acquiring evidence E, a quantity equal to P (H|E) according to Bayesian conditionalization. The posterior probability’s computation using Bayes’s Theorem requires that the prior probability of the hypothesis, P (H), and the prior probability of the evidence, P (E), are both meaningful and have well-defined values. Classical statistics holds that these prior probabilities are generally not available to guide statistical inference. In contrast, Bayesians hold that these probabilities, properly construed, are available. Hence, the controversy concerns applications of Bayes’s theorem and the interpretation of the probabilities it regulates rather than the theorem itself. A statistical test of a hypothesis collects statistical data to evaluate the hypothesis. The Bayesian decision-theoretic approach to statistics uses a statistical test along with prior information to evaluate a hypothesis. Bayesians methods generate the hypothesis’s probability at the conclusion of the test. First, they use judgment to assign prior probabilities to the hypothesis and to the test’s possible evidential outcomes. Then, in light of the test’s actual outcome E, they employ Bayes’s theorem to update the probability of the hypothesis from its prior value, P (H), to a new value equal to P (H|E). A last step may be to ground a decision on the results of the statistical test. For example, a pharmaceutical company may use a statistical test to decide whether to market a new drug. The success of marketing the drug depends on the drug’s effectiveness. So statistical tests providing information about the drug’s effectiveness may direct the company’s decision about marketing the drug. Some statisticians have employed decision methods directly to statistical inference. Lindley [1971] presents an account of decision making that analyzes gathering information and applies Bayes’s Theorem to update probabilities. Lindley’s approach includes assessment of the value of gathering new information for a decision. DeGroot [1970] treats as a decision problem selection of an experiment to perform. According to DeGroot, “A statistician attempts to acquire information about the value of some parameter. His reward is the amount of information, appropriately defined, that he obtains about this value through experimentation. He must select an experiment from some available class of experiments, but the information that he will obtain from any particular experiment is random” (p. 88). DeGroot maintains that a statistician should choose experiments that minimize the risk of being mistaken about the truth of a hypothesis. He derives this policy from Wald [1950]. Berger [1993] advocates Bayesian statistics, which employs prior probabilities,
The Bayesian Decision-Theoretic Approach to Statistics
235
because its methods incorporate information available prior to testing. Prior probabilities often settle whether one should believe a hypothesis after testing. Suppose, for example, that one tests the hypothesis that a musician can distinguish scores written by Haydn from scores written by Mozart. The musician is shown ten scores written by either Haydn or Mozart. He correctly identifies the composer of each. Next, suppose that one tests the hypothesis that a gambler can predict whether a coin toss yields Head or Tails. A fair coin is tossed ten times, and the gambler correctly predicts the result of each toss. The two tests are structurally similar. However, using prior information, one may conclude that the musician is indeed an expert but that, despite his lucky streak, the gambler is a charlatan. Reasonable statistical inference should not ignore such prior information, and Bayesian statistics folds it smoothly into probabilistic reasoning. Gregory [2006] promotes Bayesian statistics for physics and astronomy because, he claims, it combines deductive and inductive argumentation, and so is more powerful than classical statistics. He treats probability theory as a logic of science, following Jaynes [2003].1
1.2
Subjective Probability
The type of probability Bayesian statistics employs distinguishes it from classical statistics. Bayesians, in contrast with classical statisticians, employ probabilities that are relative to evidence. These probabilities are subjective because people have different bodies of evidence. Such informationally sensitive probabilities are also often called personal, evidential, or epistemic probabilities. Bayesians differ about the constraints evidence imposes on a person’s subjective probabilities, but most hold that it allows some latitude so that two rational people with the same evidence may assign subjective probabilities differently. For example, two rational physicians may reach different diagnoses of a patient’s ailment despite having the same data and medical training. However, Bayesians show that the mounting force of incoming evidence will often eventually trump personal epistemic proclivities. As two Bayesians who assign different initial probabilities to a hypothesis update their assignments in response to new relevant evidence, their updated probability assignments will usually converge to the same value, provided that the incoming evidence is ample.2 1 A hypothesis’s likelihood on the data is the data’s probability given the hypothesis. Classical statistics rejects the likelihood principle as well as Bayesian methods. Edwards [1992: 30] states the likelihood principle as follows: “Within the framework of a statistical model, all the information which the data provide concerning the relative merits of two hypotheses is contained in the likelihood ratio of those hypotheses on the data” (emphasis in the original). 2 Savage [1954/1972: Sec. 3.6] presents a theorem about convergence. It shows that evidence reduces the importance of initial probability assignments. He holds that his theorem supports convergence not only to agreement but also to truth. About a typical person, he claims (p. 50), “With the observation of an abundance of relevant data, the person is almost certain to become highly convinced of the truth, and . . . knows this to be the case.” James Hawthorne’s contribution to this volume, “Confirmation Theory,” extends Savage’s results.
236
Paul Weirich
Subjective probabilities, being relative to evidence, differ from objective probabilities that are independent of evidence. Objective probabilities do not change as evidence changes. They depend on natural phenomena and not evidence about them. To illustrate the difference between subjective and objective probabilities, consider the probability of obtaining a black ball during a random draw from an urn with 40 black balls and 60 red balls. The objective probability is 40%. If a person does not know the urn’s contents, the person’s subjective probability may have a different value. If he thinks that the urn has 50 black balls and 50 red balls, his subjective probability of drawing black may be 50%. For another example, consider a juror hearing testimony in court during a criminal case. She may favor the defendant’s innocence after hearing the defense attorney’s presentation. Then after hearing the prosecuting attorney’s presentation, she may favor the defendant’s guilt. The subjective probability of guilt varies as evidence is presented. Nonetheless, during the presentation of evidence, the objective probability of guilt is constant. It is either 0 or 1 depending on whether the defendant in fact committed the crime. A person’s subjective probability assignment to a proposition represents the person’s strength of belief that the proposition is true. It ranges from 0, which represents certainty that the proposition is false, to 1, which represents certainty that the proposition is true, and responds to evidence bearing on the proposition’s truth.indexdecision problems Objective probabilities may change over time. The objective probability that a random draw from an urn yields a black ball changes over time if the urn has a hole through which its mixture of black and red balls spills. Also, an analogue of the principle of conditionalization holds for objective probabilities. It asserts that if an event E influences the objective probability of a hypothesis H, then, if E occurs and nothing else of relevance occurs, the new objective probability of H equals the previous conditional objective probability of H on E. Despite these similarities, subjective and objective probabilities have different grounds and consequently different basic features. 2
DECISION PROBLEMS
Granting the distinction between objective and subjective probabilities, Bayesianism requires a precise account of the subjective probabilities that it uses. Various accounts are on the market. This section presents the foundations of the most common account, which explicates subjective probabilities by describing their relation to preferences among options in decision problems. The section characterizes decision problems, their representations, and relations among components of their representations such as subjective probabilities.
2.1
Preferences
In a decision problem an agent must select an act from all the acts that she may perform. In standard cases, she has a preference ranking of her options. These
The Bayesian Decision-Theoretic Approach to Statistics
237
are possible acts or, more precisely, possible decisions to perform those acts. The agent’s preference ranking of options guides her choice. The usual representation of a decision problem lists an agent’s options, possible outcomes of adopting those options, and states of the world that influence options’ outcomes. The probabilities of the states yield the probabilities of an option’s possible outcomes. The decision problem’s representation exhibits the relation between preferences among options and the probabilities and utilities of options’ possible outcomes. An option’s position in an agent’s preference ranking varies with the probabilities and utilities of the act’s possible outcomes. In typical decision problems the state of the world affects the outcome that an act produces. However, the agent often does not know the state of the world. Suppose that a set of mutually exclusive and jointly exhaustive possible states represents all features of the world that affect the various options’ outcomes. For each possible state, an option has a specific outcome. Given a decision problem’s representation in terms of options, states, and outcomes, standard decision principles apply. For example, in special cases a simple comparison of possible outcomes indicates a preference between a pair of options. If, for example, the outcome of one option is superior to the outcome of another option in each possible state, then the first option strictly dominates the second option. If an option’s adoption does not influence the state that obtains, then a rational ideal agent prefers the first, dominating option to the second, dominated option. There are many ways of elaborating a decision problem’s representation in terms of options, states, and outcomes. A major issue concerns the nature of possible outcomes of an agent’s options. These possible outcomes are objects of an agent’s preferences and also objects of an agent’s probability assignments. Among the candidates for outcomes are dated commodity-bundles, events, and propositions. A preference between a pair of dated commodity-bundles is, more precisely, a preference to acquire one dated commodity-bundle rather than the other. Because acquisition of a dated commodity-bundle is an event, taking events rather than dated commodity-bundles as objects of preference better represents the psychology of preference. The laws of probability govern a Boolean algebra of events formed by applying to basic events operations that yield compound events. Events of many varieties may yield such an algebra. Selecting events with special features to be the objects of probability advances nonstructural, epistemic goals of probabilities. Because the subjective probability an agent assigns to an event depends on the agent’s grasp of the event, subjective probabilities fit abstract events-types better than concrete events. Taking outcomes to be propositions, rather than events-types, is a further improvement, because it makes the objects of preference more fine-grained and responsive to an agent’s information. Oedipus did not know that Jocasta was his mother. This explains why he preferred marrying Jocasta to not marrying her but did not prefer marrying his mother to not marrying her. The proposition that he marries Jocasta differs from the proposition that he marries his mother,
238
Paul Weirich
even if the corresponding events-types are the same. Taking preferences to compare propositions thus improves fidelity to psychology. Many Bayesians, such as Jeffrey [1965/1983], take propositions rather than events-types as the objects of preference, or they take events-types to be finely individuated by the propositions that describe them. Psychological considerations concerning an agent’s grasp of propositions motivate making objects of preference even more fine-grained than are propositions, but this essay puts aside that complication and treats outcomes and other objects of preferences and probability assignments, such as options and states, as propositions. Using this account of options, states, and outcomes in a decision problem, let us explore the relation between preferences among options and probabilities of states. Suppose that a person at a particular time has preferences among bets. Those preferences may reveal the person’s probability assignment. Suppose, for example, that he considers a bet that S and a bet that T , and the stakes for each bet are the same. If he prefers the bet that S to the bet that T , then that preference indicates that he assigns a higher probability to S than to T . In this way, by considering bets for many possible gambles, one may infer a person’s probabilities from his preferences among gambles. For example, suppose that a person is just as ready to bet that it will snow as that it will not snow. Because P (S or ∼ S) = 1 and P (S)+P (∼ S) = P (S or ∼ S), if P (S) = P (∼ S) then both probabilities equal 0.5. Given four mutually exclusive and jointly exhaustive equiprobable events, one may similarly infer that the probability of each is 0.25. Also, given eight mutually exclusive and jointly exhaustive equiprobable events, one may infer that the probability of each is 0.125. One may measure an arbitrary event’s probability as finely as one pleases by comparing it to events on scales formed with disjunctions of equiprobable events. The utility an agent assigns to a possible outcome of an option is a number that represents the possible outcome’s subjective value to the agent. A gamble’s expected utility depends on the probabilities and utilities of its possible outcomes. Winning the gamble has a probability and utility. Losing the gamble also has a probability and utility. The product of an outcome’s probability and utility indicates the subjective value of the prospect of that outcome. The sum of these products for a gamble’s possible outcomes is the gamble’s expected utility. In general, an option’s expected utility depends on a set of mutually exclusive and jointly exhaustive possible states. For each state, one considers the outcome of the option in the state and forms the product of the outcome’s probability and its utility. The option’s expected utility is the sum of the products. Any set of states that are mutually exclusive and jointly exhaustive yields a corresponding set of possible outcomes. That set of possible outcomes in turn yields the option’s expected utility. The option’s expected utility is the same regardless of the set of mutually exclusive and jointly exhaustive states that one employs to assess it. An agent’s utility assignment depends on the agent’s personal goals, and so is subjective. Because utility is sensitive to an agent’s information as well as his personal goals, an option’s utility equals its expected utility. This basic principle
The Bayesian Decision-Theoretic Approach to Statistics
239
of utility analysis connects the probabilities and utilities of an option’s possible outcomes to the option’s position in a rational agent’s preference ranking of her options. A rational agent’s preferences among options agrees with the options’ utilities and therefore with their expected utilities.
2.2
Representation Theorems
A decision problem’s presentation lists options, states, and outcomes. An agent has preferences among outcomes and options. She also assigns probabilities to states and utilities to outcomes and options. How are these preferences, probabilities, and utilities related? Representation theorems show that if a person’s preferences comply with certain plausible axioms for relations among preferences, then the person’s preferences entail a probability assignment to possible states of the world. Some preference axioms express requirements of rationality. For example, one axiom requires that preferences be transitive. Other axioms are structural. Compliance with them ensures a rich structure of preferences. For example, one structural axiom requires that, for any two options such that a person prefers one to the other, a third option has an intermediate preference rank. To illustrate, suppose that a person prefers betting that Heads turns up on a coin toss to betting that a six turns up on a roll of a die. Then for another proposition such as that drawing a card from a standard deck yields a diamond, she prefers betting that Heads turns up to betting that a diamond is drawn, and prefers betting that a diamond is drawn to betting that a six turns up. The normative axioms are principles of rationality for the preferences of an ideal agent. Meeting them is necessary for having preferences that follow expected utilities. The nonnormative, structural axioms are not principles of rationality. Meeting them is not necessary for having preferences that follow expected utilities. The structural axioms ensure that preferences among options are extensive enough to derive probabilities of states from those preferences assuming that the preferences follow expected utilities. A typical representation theorem shows that if a person’s preferences concerning gambles satisfy the preference axioms, then there is a unique assignment of probabilities to possible states such that a person prefers betting that one possible state obtains to betting that another possible state obtains just in case the expected utility of the first bet exceeds the expected utility of the second bet. Some very general representation theorems show that one may infer an agent’s probabilities and utilities simultaneously. They show that if an agent’s preferences concerning gambles satisfy certain axioms, then there is a unique assignment of probabilities and also a unique assignment of utilities (given a choice of scale) such that the agent prefers one gamble to another just in case the first’s expected utility exceeds the second’s expected utility. Savage [1954/1972] presents a famous representation theorem of this sort. His proof uses techniques that Ramsey [1931/1990] pioneered. Section 7, an appendix, presents the axioms of preference that Savage’s
240
Paul Weirich
representation theorem assumes. A general representation theorem does not justify the laws of probability and the expected utility principle. As explained, it shows only that given any preference ordering of options that satisfies the preference axioms, there is a unique assignment of numbers to possible states of the world complying with the laws of probability, and (given a choice of scale) a unique assignment of numbers to outcomes complying with the laws of utility, such that the expected utilities of options (as computed from these probability and utility assignments) accurately reflect the preference ordering of options. The expected utility principle, for example, requires that an option’s utility equal its expected utility. It is a normative principle. Its justification requires normative principles involving probabilities and utilities and not just a representation theorem’s normative axioms of preference.
2.3
Refinements
The standard formula for an option’s expected utility assumes that the option’s realization does not affect a state’s probability. If an option’s realization affects a state’s probability, then the probability of the state given the option replaces the probability of the state. The probability of a state S given an option O by definition is a ratio of nonconditional probabilities: P (S|O) = P (S&O)/P (O). The ratio’s being high does not establish that O’s occurrence will probably bring about S. The option O may be a sign of the state S without having any causal influence on S. For example, getting ten Heads on ten tosses in a row may be a sign that a coin is biased for Heads. However, the string of Heads does not causally influence the result of the eleventh toss. Also, although the probability of measles given the characteristic rash is high, the rash does not cause measles. It is just a symptom of the disease. Causal decision theory, as expounded by Joyce [1999], for example, teaches that conditional probabilities used to compute expected utilities should not be standard conditional probabilities because those conditional probabilities do not distinguish causation from mere correlation. Expected utilities that direct action require a type of conditional probability that attends only to causation.3 The utilities of options, given by options’ expected utilities, are the grounds of the decision principle of utility maximization. It requires adopting an option with a utility at least as great as the utility of any other option. However, if an option’s realization furnishes evidence about which possible state of the world obtains, a decision principle should take account of this evidence, and standard decision theory needs revision to accommodate it. One such revision advances the principle of self-ratification — a principle that generalizes the principle of utility maximization. It requires adopting an option that maximizes utility assuming the option’s realization. To illustrate the difference between utility maximization and self-ratification, consider a person playing a game of matching pennies and trying for a match. 3 Freedman [1997: 120–121] also distinguishes two types of conditional probability. One type responds to correlation, and the other responds to causation.
The Bayesian Decision-Theoretic Approach to Statistics
241
He thinks that his opponent is equally likely to show Heads or to show Tails. His showing Heads thus maximizes utility. However, his opponent is a clever psychologist. If he shows Heads, then that act is evidence that his opponent has predicted it. Given the evidence the act provides, his opponent is likely to show Tails. Thus, his showing Heads does not maximize utility given its realization. It is not self-ratifying. In some decision problems, no option is self-ratifying. Hence, a more general decision principle is required – one that recommends adopting an option that is self-supporting in a broader sense than being self-ratifying. An option is selfsupporting in that broader sense if and only if given its adoption one lacks a sufficient reason for adopting an alternative. Suppose that no option is self-ratifying. Some option may nonetheless be self-supporting. Although given its adoption another option has higher utility, that reason for the other option is not sufficient if all options generate reasons for alternatives. Decision theorists distinguish standards of evaluation for acts from rational procedures for selecting acts. Whether probability should always have a place in rational procedures for selecting acts is a matter of debate. In some cases the procedure of utility maximization is computationally complex, and its computational cost is not justified by its benefits. Probabilities should not guide action when the guidance they supply has an excessive cost. However, probabilities clearly have a role to play in idealized normative standards of evaluation, where the cost of evaluation is not relevant. For example, the standard of utility maximization may appropriately evaluate a decision even in cases where the standard’s application requires high-cost, complex calculations of expected utilities.
3
DEGREES OF BELIEF AND DESIRE
The previous section reviews a decision problem’s standard representation and also relationships between the representation’s components, in particular, the relationship between probabilities and utilities of options’ possible outcomes and preferences among options. How should Bayesian statistics use those relationships to illuminate subjective probabilities?
3.1
Definition
A representation theorem assumes that preferences follow expected utilities and then derives probabilities and utilities from the preferences. Some theorists use a representation theorem to ground explicit definitions of subjective probabilities and utilities. They define a person’s probabilities and utilities as the values of the functions that the person’s preferences imply when the preferences satisfy the theorem’s axioms. Accordingly, preferences among options agree with the options’ expected utilities by definition. So the injunction to form preferences in accord with expected utilities has no normative force.
242
Paul Weirich
The previous section distinguishes a representation theorem’s normative and merely structural axioms of preference. The normative axioms of preference are principles of rationality, and complying with them, given the nonnormative, structural axioms, is equivalent to having preferences that imitate agreement with expected utilities. A stronger normative principle requires not only that preferences behave as if they agree with expected utilities, but also requires that preferences agree with expected utilities. This stronger principle assumes that probabilities and utilities are defined independently of preferences and the expected utility principle. If probabilities and utilities are defined in terms of preferences among options, then methods of using them to form preferences among options seem otiose because those preferences generate the probabilities and utilities. The probabilities and utilities do not generate the preferences. However, in special cases a two-step method, proceeding as follows, generates preferences among options. First, obtain probabilities of states and utilities of outcomes from preferences among a set of options. Suppose that additional options exist involving states and outcomes relevant to the original options. Then one may perform a second step generating preferences among the additional options. One may obtain the additional options’ expected utilities using probabilities and utilities already measured and use those expected utilities to form preferences among the additional options. Forming the new preferences this way ensures that the expanded set of preferences complies with the preference axioms and is consistent with the preference ordering of the original options. This method of preference formation assumes that different options may have the same outcomes and that preferences among the original options remain constant during formation of preferences among the additional options. A person’s basic preferences among features of possible worlds, such as personal health, explain the person’s preferences among possible worlds and preferences among outcomes. Defining probabilities and utilities in terms of preferences among options does not yield an account of a person’s basic preferences among features of possible worlds. The representation theorems work with preferences no more basic than preferences among possible worlds. The utility functions they yield do not analyze the utility of a possible world into the utilities of its features. Possible worlds are the atoms of the preference structures a representation theorem analyzes. Moreover, satisfying a representation theorem’s structural axioms generally requires more preferences than a typical person has. When confronted with gambles concerning unstudied states, many people reasonably do not form preferences among those gambles. If forced to select a gamble, a person’s selection does not reveal a genuine preference. For example, a person may not have preferences concerning bets about the number of votes cast in the last presidential election. If that person is forced to select a bet about that number, she may select a bet at random so that her selection does not reflect a preference. Probabilities and utilities inferred from one set of a person’s preferences may differ from probabilities and utilities inferred from another set of the same person’s
The Bayesian Decision-Theoretic Approach to Statistics
243
preferences. To handle this problem, a definition of probabilities and utilities in terms of preferences may restrict itself to cases in which a person’s preferences are coherent in the sense that all sets of the person’s preferences implying probabilities and utilities imply compatible probabilities and utilities. However, definitions of probabilities and utilities in terms of a person’s preferences typically take a different approach. They use a comprehensive set of preferences. That is, they use all preferences that the person has. They assume that those preferences are coherent in the sense of complying with standard normative axioms of preference. If preferences define probabilities, then updating probabilities is just updating preferences. Representation theorems need additional axioms of preference governing change of preferences over time to permit inference of probabilities from preferences holding at different times. The handiest axiom requires that preferences be updated so that it is as if they agree with expected utilities after updating probabilities by conditionalization. As this section explains, defining probabilities in terms of preferences creates problems. An alternative to using a representation theorem to ground an explicit definition of probabilities in terms of preferences is to take probabilities as rational degrees of belief attaching to propositions. Rational degrees of belief conform to the laws of probability, as Section 4 argues. They yield a suitable interpretation of the probabilities that those laws govern. Assigning a degree of belief to a proposition is having a psychological attitude toward that proposition. The attitude is a mental state not defined in terms of preferences but inferable from them if the agent is rational, cognitively ideal, and in ideal circumstances for forming degrees of belief.4 Giving degrees of belief an interpretation conceptually independent of preferences is congenial with nonbehavioristic accounts of mental states. Of course, such an interpretation must explain the propositional attitude a degree of belief represents. The next subsection sketches an explanation. Nothing in the Bayesian decision-theoretic approach to statistics requires that probabilities and utilities be defined in terms of preference. It suffices that probabilities and utilities be inferable from preferences. The representation theorems establish their inferability from preferences assuming that preferences satisfy standard axioms. Because of problems that definitions using preferences generate, the Bayesian decision-theoretic approach to statistics is on firmer ground if it takes probabilities and utilities to be expressions of attitudes toward propositions and takes probabilities and utilities as inferable from preferences, but not defined in terms of preferences.
3.2
Inference
A degree of belief attached to a proposition is a degree of belief that the proposition is true. Degrees of beliefs may be implicitly defined by the theories to which they 4 These idealizations skirt objections that Maher [forthcoming] raises. Weirich [2004] shows how to weaken the idealizations without losing precision.
244
Paul Weirich
belong, as in Weirich [2001]. Degrees of belief are quantitative representations of belief states, but do not presume that belief states are themselves quantitative. The belief states and their representations have many independent features. For example, two belief states, one resting on more extensive evidence than the second, may receive the same quantitative representation but may behave differently in response to new information. The more extensively supported belief state may change less rapidly than the less extensively supported belief state changes. A belief state is more complex than representation by a single number indicates. Furthermore, although traditionally degrees of belief use the real number system, belief states may have features that warrant alternative representations. Nonstandard numerical analysis inspires representations of belief states that accommodate infinitesimal degrees of belief. Taking probabilities as rational degrees of belief strengthens the norms that decision theory imposes. As mentioned, it makes preferences’ agreement with expected utilities a normative requirement, not a definitional truth. For example, if betting has higher expected utility than not betting, then the normative principle says that one should prefer betting. In contrast, the corresponding definitional truth holds that by the meaning of expected utility, one prefers betting to not betting. Otherwise, it is false that betting’s expected utility exceeds not betting’s expected utility. Assuming that probabilities are rational degrees of belief, that one option has higher expected utility than another explains why a rational person prefers the first option to the second. Expected utilities justify preferences. Another advantage of this interpretation of probabilities is that one may calculate expected utilities to form preferences without extracting probabilities and utilities from preferences already formed. The normative principle to follow expected utility applies to a single preference and does not require constant preferences among some options to generate probabilities of states. Preferences may change from one moment to the next, and need not be the same throughout a period of time during which they entail probabilities and utilities that generate new preferences. Normative principles of preference formation require fewer resources if probabilities are not defined in terms of preferences. Furthermore, taking probabilities as rational degrees of belief yields a richer account of the factors that affect preferences among options. Because of reliance on representation theorems, some Bayesians constrain an option’s outcome so that options may more easily share an outcome. This facilitates extracting outcomes’ probabilities and utilities from preferences among options. However, constraining outcomes may exclude factors that affect rational preferences among options. For instance, a rational preference concerning an option may take account of the risk the option runs and the agent’s attitude toward that risk. The risk may depend on features of the option such as the agent’s distribution of degrees of belief over the option’s possible outcomes. Options with a different distribution do not share the same risk. Constraining outcomes to promote shared outcomes conceals the grounds of an agent’s preferences. A representation theorem using
The Bayesian Decision-Theoretic Approach to Statistics
245
such constrained outcomes shows only that an agent’s preferences are as if they agree with expected utilities and as if the agent cares only about the factors in the constrained outcomes. In contrast, taking probabilities as rational degrees of belief permits a comprehensive account of an option’s possible outcomes that includes factors such as risk. This comprehensiveness is vital for decision theory because the normative principle of expected-utility maximization is sound only if possible outcomes are comprehensive. Taking probability and utility as implicitly defined theoretical terms retains the value of representation theorems. These theorems still show that given their assumptions one may infer probabilities and utilities from the rational preferences of an ideally situated ideal agent. Probabilities and utilities still have that grounding in preferences. Other means of inferring probabilities are also possible, however. A person may use the laws of probability to infer some probabilities from others. Also, a person may use introspection to identify some probabilities. For example, he may know that he is certain of some state of the world and so assigns 1 as its probability. He may infer the probability’s value without extracting a complete probability assignment from his preferences. Moreover, one may infer probabilities from their causes as well as from their effects. For example, one may infer some probabilities from an agent’s evidence. Suppose that an agent knows the objective probability of an event. If he is rational, one may infer that the probability he assigns to the event has the same value as its objective probability. This inference invokes a version of Lewis’s [1986: 87] Principal Principle, which moves from knowledge of a proposition’s objective probability to a corresponding subjective probability assignment to the proposition. One may infer a person’s degrees of belief from a small set of her preferences. Suppose that a person is willing to buy or sell for $0.40 a bet that pays $1 if the state S holds and $0 if it does not. These betting odds yield a betting quotient of 40%. Given ideal conditions, one may infer that the person’s degree of belief that S holds equals 40%. Only that value is compatible with expected-utility maximization using degrees of belief. In general, betting quotients equal degrees of belief. Some theorists take the equality of degrees of belief and betting quotients as a definition of degrees of beliefs. The definition grounds degrees of belief as solidly as does any representation theorem. Also, the definition yields degrees of belief in cases where an expected-utility representation of all preferences does not exist. This extension of degrees of belief is attractive because it extends normative decision principles to more cases. It applies those principles not only when an expected-utility representation of preferences exists but also in other circumstances. Despite the advantages of defining degrees of belief in terms of betting quotients, a theory of rationality does better, all things considered, taking degrees of belief as implicitly defined theoretical entities. Then it may advance the normative principle that betting quotients should equal degrees of belief. A broad psychological account of degrees of belief enhances a theory of rationality’s normative power.
246
Paul Weirich
4
PROBABILITY AXIOMS
For degrees of belief to be probabilities, they have to conform to the probability axioms and hence to all the laws of probability. Taking probabilities as rational degrees of belief requires arguing that if degrees of belief are rational, they conform to the axioms, at least in the case of an ideal agent without cognitive limits and in ideal conditions for forming degrees of belief.
4.1
The Dutch Book Argument
A common way of arguing for conformity to the probability axioms examines the consequences of using degrees of belief to guide action if they fail to conform to the axioms. De Finetti [1937/1964] pursues this line of argument. He takes an agent’s degrees of belief to govern the odds that the agent posts for buying and selling gambles. Consequently, if degrees of belief do not comply with the probability axioms, then the agent’s betting quotients license a Dutch book. This is a system of bets that guarantees a net loss. The case for compliance with the probability axioms is called the Dutch book argument. Kemeny [1955] shows that compliance with the probability axioms is necessary and sufficient for preventing Dutch books. In the Dutch book argument, one may assume that the bookie has the same information that the bettor has and that no gain elsewhere compensates for a system of bets that guarantees a loss. Bets may be on ethically neutral states that do not affect enjoyment of gains and on states eventually within the better’s and the bookie’s ken so that bets are settled. Making the argument airtight requires many background assumptions, as Kyburg [1983: 81–85] notes. Skyrms [1990: Chap. 5] views the argument as only a loose dramatization of the incoherence of having degrees of belief not conforming to the probability axioms. Christensen [1996] presents an account of that incoherence. A number of philosophers criticize Dutch book arguments. Kennedy and Chihara [1979], for example, argue that Dutch book arguments fail to establish the intended conclusion that conformance of belief strengths to the probability axioms is necessary and sufficient for rational betting. However, this section present a different objection, one that grants that a careful version of the Dutch book argument establishes its conclusion about conformity to the laws of probability. The objection maintains that nonetheless the argument does not justify the probability laws for degrees of belief. Because the laws are epistemic principles for degrees of belief, a justification of the laws requires epistemic reasons for complying with the laws. Dutch book arguments offer pragmatic reasons for compliance with the laws of probability. A proper epistemic justification gives a person reasons to comply with the laws even if she does not gamble and does not post odds for bets. Rosenkrantz [1981], for instance, distinguishes cognitive and pragmatic goals. As he argues, “We need . . . an argument to show why incoherent beliefs are irrational from the perspective of the agent’s purely cognitive goals” (p. 2.1-4). Dutch book arguments supply no such epistemic justification for believing probabilistically.
The Bayesian Decision-Theoretic Approach to Statistics
247
Suppose that a representation theorem grounds a method of inferring probabilities and utilities. Then it also furnishes an argument for compliance with the probability laws. It shows that, given a rich preference structure, having degrees of belief that violate a probability law entails violating a normative preference axiom if degrees of belief yield options’ expected utilities. However, because this argument depends on axioms of preference, it is not purely epistemic. Probability has a role in computing expected utilities that guide action. So it has a pragmatic role as well as an epistemic role. Why should one want a purely epistemic argument that rational degrees of belief comply with the laws of probability? One may infer that rational degrees of belief comply with probabilistic laws given their role in action-guiding principles. However, degrees of belief epistemically assess propositions for truth. A pragmatic argument does not explain why it is epistemically appropriate that the laws of probability govern degrees of belief. So a successful epistemic argument offers more than a pragmatic argument does. It offers an explanation of the epistemic force of the probability laws for degrees of belief. It does this by revealing epistemic reasons for having degrees of belief that follow the probability laws. Besides establishing that degrees of belief conform to the laws of probability, it epistemically explains their conformity to those laws.
4.2
Calibration
An epistemic argument for the probability laws may appeal to epistemic goals. One such epistemic goal for degrees of belief is called calibration. An agent’s degrees of belief are calibrated if, for each proposition, the agent’s degree of belief that the proposition is true equals the proposition’s objective probability. Because objective probabilities obey the laws of probability, an agent’s degrees of belief equal the corresponding objective probabilities only if those degrees of belief also obey the probability laws. Thus, the epistemic goal of calibration furnishes an appropriate epistemic reason for having degrees of belief that conform to the probability laws. Calibration is an epistemic goal for degrees of belief in much the same way that the epistemic goal of believing all and only true propositions is an epistemic goal. A person falls short of the goal for belief when she believes a false proposition or fails to believe a true proposition. Calibration is a suitable epistemic goal given that objective probabilities, not truth-values, are physically accessible. A person falls short of this goal for degrees of belief if her degree of belief that a proposition is true is more or less than the proposition’s objective probability. A person also fails to reach the goal of calibration with respect to a proposition if she fails to form a degree of belief that the proposition is true. Reaching the goal of calibration has epistemic value, and the further a person departs from the goal, the less epistemic value her degrees of belief have. Shimony [1988: 80] advances a similar epistemic argument for compliance with the probability laws. He says, “The crux of [his proposal] is the rough idea that epistemic probability is somehow an estimate of relative frequency.” Epistemic probability is another term for rational degree of belief, and relative frequency is
248
Paul Weirich
a kind of objective probability. Because people are not in a position to achieve perfect calibration, a rational degree of belief is an estimate of an objective probability. A revised calibration argument shows that the probability laws govern rational estimates of objective probabilities, too. Rosenkrantz [1981] favors a calibration argument that uses Brier scores as a means of calculating how well calibrated degrees of belief are. Van Fraassen (1983) and Joyce (1998) elaborate calibration arguments, too. For present purposes, let us grant the epistemic goal of having degrees of belief that match objective probabilities. Then an important elaboration of this idea is an account of the best means of pursuing the goal of calibration given our ignorance of objective probabilities. Let us also suppose for present purposes that the pursuit of calibration is best accomplished by making estimates that obey the probability laws. Thus, successful degrees of belief accord with the laws of probability. Granting all these background assumptions, a problem nonetheless looms for this sort of justification of conforming belief strengths to the laws of probability. Forming degrees of belief with an eye on the epistemic goal of calibration runs a risk of forming degrees of belief that do not guide action well. To introduce the problem, I shall first show how belief formation guided by epistemic goals creates a similar risk. Epistemologists put aside pragmatic reasons for belief, although they acknowledge that pragmatic considerations may justifiably influence belief because of belief’s role in communication and other areas besides epistemic inquiry. Their discipline focuses on epistemic reasons for belief and epistemic justification of belief. Some accounts of epistemic justification of belief draw on accounts of rational pursuit of epistemic goals such as holding true beliefs. Levi [1967], for instance, proposes forming beliefs that maximize epistemic utility. Maher [1993: Chaps. 6–8] similarly advocates accepting a hypothesis according to its epistemic utility. A belief’s epistemic utility depends on the epistemic gain if it is true and the epistemic loss if it is false. Factors such as informational value affect epistemic gain. However, forming beliefs because of their informational value makes beliefs poor guides to action. Suppose that a hypothesis H has high informational value so that believing it maximizes epistemic utility despite the risk of error. An agent forms the belief that H and then uses the belief to direct action. Suppose that a possible outcome of a particular act ranks high all things considered, including attainment of epistemic goals such as informational value. That outcome obtains if H is true, and, because the agent believes that H, she performs that act. Informational value’s desirability then influences the attractiveness of the act twice, once in making belief that H attractive and a second time in making the act’s expected outcome attractive. The result is double counting informational value and overvaluing the act. Levi’s [2006] method of belief and degree of belief formation is susceptible to this problem. He treats beliefs as certainties and uses beliefs as evidence grounding degrees of belief that direct action. Epistemic goals influence belief formation,
The Bayesian Decision-Theoretic Approach to Statistics
249
formation of degrees of belief, and then action. According to Levi, whether an agent’s inquiry should culminate in accepting a hypothesis H depends on the expected epistemic value for the agent of believing that H. His account of the expected epistemic value of a belief uses credal probabilities. A credal probability is a type of subjective probability that is relative to evidence. Suppose that QK is a probability function that for an agent assigns credal probabilities to propositions according to her current corpus of knowledge K. Let α be the agent’s degree of caution in belief formation, and suppose that a value assignment Vα assigns for the agent an epistemic value (with respect to α) of believing a hypothesis given a possible truth-value for it. Then the following formula expresses the agent’s expected epistemic value with respect to α of believing that H. EV α (H) = QK (H)Vα (H, true) + QK (∼ H)Vα (H, false) Rival hypotheses have expected epistemic values also. An agent should accept H only if the expected epistemic value of believing H is at least as great as the expected epistemic value of believing any rival hypothesis. If probabilities guide action but rest on beliefs adjusted to promote informational value, then epistemic goals are double counted. They count once in adjustments of beliefs and a second time in decisions about action. Action thus accords excessive weight to an agent’s epistemic goals. An agent who follows Levi’s principle of belief formation may be exploited as a result. Suppose, for example, that she boosts from 0.8 to 1.0 the probability of a hypothesis because accepting the hypothesis maximizes expected epistemic value in light of the hypothesis’s high informational value. Before accepting the hypothesis, she may sell for $0.80 a ticket that pays a dollar if the hypothesis is true and otherwise nothing. After accepting the hypothesis (because acceptance amounts to certainty), she may buy back the ticket for $1.00. She loses $0.20 from these transactions. The selling and buying back indicate a preference reversal that occurs without any change in basic goals or evidence. The combination of transactions is irrational. The only epistemic justification of belief is fit with evidence. Even the epistemic goal of true belief does not justify belief because the goal may lead to a belief that does not fit with evidence. It may prompt excessive epistemic risk-taking. A belief that aims for informational value and not just fit with evidence lacks epistemic justification despite having purely epistemic goals. Some epistemic goals for beliefs may make beliefs unsuitable guides to action. Action-guiding beliefs should maintain fit with evidence. The kinds of problems that affect formation of beliefs according to epistemic goals also affect formation of degrees of belief to achieve the goal of calibration. Suppose that an agent initially assigns a degree of belief 0.5 to the state S and then discovers, without acquiring additional empirical evidence, that the best prospect of calibration obtains if she boosts to 0.6 her degree of belief that S holds. Then she may sell a dollar gamble that S for $0.50 and afterwards buy it back for $0.60, losing $0.10 without any change in empirical evidence concerning S. The influence
250
Paul Weirich
of the goal of calibration leaves the agent open to exploitation. Letting calibration guide degrees of belief tends towards a form of wishful thinking. Aiming for calibration, an agent may adjust a degree of belief to enhance the prospect of calibration and not just to respond to the strength of evidence. Degrees of belief formed to match objective probabilities may fail to match strengths of evidence. An ideal agent in ideal circumstances should not let anything other than evidence direct probability assignments that control action. Only the goal of representing strength of evidence suits probability’s role in guiding action. Letting epistemic goals influence formation of beliefs and degrees of belief makes belief states the result of pursuit of those goals. An agent risks failure to respond well to strength of evidence if she aims for calibration. Beliefs and degrees of belief guide action best if they just represent evidence’s strength (as the agent evaluates it if there is room for judgment). Belief states are not in an agent’s direct control, and that makes them suited for guiding decisions that aim for pragmatic goals. Decision principles rely on assessments of strength of evidence, and degrees of belief provide those assessments. If degrees of belief are to guide action well, they should respond only to strength of evidence and not also to calibration. Degrees of belief pursuing calibration may misrepresent strength of evidence, and so fail to guide action well. Suspension of judgment is often reasonable when evidence is scarce. Similarly, not forming a degree of belief is often reasonable when evidence is scarce. The goal of calibration encourages forming a degree of belief when one ought to suspend judgment because evidence does not warrant precision. It encourages spurious precision of the sort Kaplan [1996, Chap. 1] deplores. Just as an agent may rashly form a belief for the sake of informational value, an agent may rashly form a degree of belief for the sake of calibration. It is better to suspend judgment and forgo prospects of calibration if evidence does not support the degree of belief, that is, if the degree of belief does not accord with strength of evidence. Calibration is a purely epistemic goal and offers a purely epistemic argument for the probability laws. Still, it does not justify the laws. It does not explain their truth. Calibration is not the proper epistemic goal for a justification of the probability laws. Rather, the proper goal is having degrees of belief that accord with evidence. I assume without argument that the epistemic goal for degrees of belief is matching strength of evidence and only say a few words to clarify this goal. Its consequences are straigthforward. For example, if one’s evidence for a hypothesis is strong, then ideally one has a high degree of belief that the hypothesis is true. Rational degree of belief may diverge from strength of evidence in some technical senses. However, in the sense of strength of evidence that measures confirmation with respect to total evidence, for an ideal agent in ideal circumstances rational degree of belief matches strength of evidence. The goal of matching strength of evidence is a subsidiary epistemic goal. It serves the primary goal having degree of belief 0 in falsehoods and degree of belief 1 in truths. Given our normal condition of incomplete information, the best means of pursing the primary goal is pursuing the secondary goal.
The Bayesian Decision-Theoretic Approach to Statistics
251
Perhaps the correct way to pursue the goal of calibration, given ignorance of objective probabilities, is to form degrees of belief so that they represent strength of evidence. An agent achieves this subsidiary goal when, for each proposition for which the agent has a degree of belief, the agent’s degree of belief that the proposition is true matches the agent’s strength of evidence for the proposition. If the goal of calibration leads to degrees of belief representing strength of evidence, then pursuing it makes degrees of belief suited to their role in the evaluation of action. Degrees of belief responding to and only to that epistemic goal will not go wrong. However, even if the goal of calibration requires that degrees of belief match strength of evidence, it does not justify the laws of probability. The complaint against Dutch book arguments applies here as well. One may only infer that degrees of belief conform to the probability laws because they pursue well the goal of calibration. That inference is not an explanation of the rationality of their compliance with the laws. The goal of calibration is epistemic, but the argument from calibration fails for the same reason that a pragmatic argument fails. An explanation of the probability laws must show why degrees of belief representing strength of evidence satisfy those laws. An epistemic justification of the laws must show that the goal of representing strength of evidence leads to degrees of belief that comply with those laws. Not every epistemic argument serves this purpose. Because the epistemic job of degrees of belief is representation of strength of evidence, an explanatory argument must show that doing this job requires conformity with the probability laws. To highlight the problem, suppose that rational pursuit of calibration requires degrees of belief that minimize expected deviation from objective probabilities. The measure of deviation may be designed so that aiming to minimize it requires using degrees of belief that fit evidence. Then perhaps it follows that if degrees of belief are to minimize expected deviation, they must satisfy the probability laws. However, such an argument still does not justify their compliance with the probability laws. Their compliance is only an artifact of the measure of deviation, so the measure of deviation needs justification for the argument to work. Belief and degree of belief are passive states. They should respond to evidence and not to informational value, calibration, or any goal whether pragmatic or epistemic. An agent should only indirectly control formation of degrees of belief by taking reasonable steps to make the formation process produce degrees of belief that represent strength of evidence. A justification of the probability laws for degrees of belief, an explanatory argument, must show that degrees of belief do not represent strength of evidence unless they conform to the laws of probability. Proper control may produce justified degrees of belief. One may control degrees of belief so as to make them reflect strength of evidence. Then they are perfect for their role in guiding action. However, controlling degrees of belief to make them represent strength of evidence may have greater costs than benefits, just as removing inconsistencies in beliefs may have greater costs than benefits. Whether an agent should take steps to ensure that her system of degrees of belief has some desirable property depends on the expected costs and benefits of taking such steps.
252
Paul Weirich
Similarly, pursuit of calibration may not be a worthwhile exercise of belief-control all things considered. Perhaps it will shed light on the issue to look at belief-control from the perspective of artificial intelligence. How should one design the formation of degrees of belief for a utility-maximizing robot with multiple goals? A good design incorporates a mechanism for forming degrees of belief according to evidence. It does not let the robot control the formation process. If the robot controls that process, it may well use its control to promote goals besides having degrees of belief that represent strength of evidence. A utility maximizer exercises any control it has to improve its prospects of meeting all its goals. It never acts for the sake of one goal without regard for its other goals. Hence if the robot’s mechanism for forming degrees of belief is to have only the goal of representing evidence, then the robot should not control the belief-forming mechanism. Similarly, people have better prospects for degrees of belief that match strength of evidence because they do not directly control formation of degrees of belief. 5
CONDITIONALIZATION
Bayesianism advances the principle of conditionalization for updating probabilities as evidence changes. The principle uses standard conditional probabilities to update nonconditional probabilities when new evidence arrives. This section presents conditionalization and then examines some common misunderstandings of, and objections to, conditionalization.
5.1
Conditionalization’s Features
According to the principle of conditionalization, when one acquires new evidence E (and no other evidence), the probability of a hypothesis H given the new evidence E, PE (H), equals the former probability of the hypothesis conditional on the evidence. That is, PE (H) = P (H|E) = P (H&E)/P (E). The equality PE (H) = P (H|E) is a principle of inductive inference. It is not a definition because the first probability exists only after acquisition of the evidence E, whereas the second probability exists prior to acquisition of that evidence. Applications of conditionalization distinguish prior and posterior probabilities. P (H) and P (E) are called prior probabilities. They are probabilities assigned prior to acquisition of new data and are updated with the acquisition of new data. A prior probability such as P (E) may but need not be calculated using a formula such as P (E) = P (E|H)P (H) + P (E| ∼ H)P (∼ H). It represents a rational agent’s degree of belief and may be inferred from the agent’s betting behavior. PE (H) is the probability of H after acquiring exactly evidence E. It is a posterior probability as opposed to a prior probability.5 5 The probability of two events together is called a joint probability, and the probability of one event alone is called a marginal probability. Thus, sometimes P (H&E) is called a joint probability, and P (H) and P (E) are called marginal probabilities. However, in the context of
The Bayesian Decision-Theoretic Approach to Statistics
253
To illustrate, suppose that an agent becomes certain that a piece of blue litmus paper turned red after immersion in a liquid. Accordingly, the agent’s probability assignment to that proposition equals 1. The proposition’s truth is a newly acquired bit of evidence. The principle of conditionalization claims that, after acquiring just that evidence, the probability of the hypothesis that the liquid is acid equals the probability of the hypothesis given that evidence. Standard Bayesian conditionalization assigns newly learned evidence E a probability of 1 because it understands acquisition of evidence that E as coming to be certain that E. However, Bayesian conditionalization may be generalized to cases where a person’s observational experience leaves her less than certain that an evidential proposition E is true. Jeffrey [1965/1983] shows how to generalize conditionalization for cases in which an experience changes a person’s probability assignment but does not increase any evidential proposition’s probability all the way to 1. For example, an observation of color in bad light may influence a person’s probability assignment that way. According to Jeffrey’s generalization of conditionalization, when an experience changes a person’s probability for an evidential proposition E from a previous value Pold (E) to a new probability value Pnew (E), the new conditional probability of a hypothesis H given the truth of E will generally not change – that is, Pnew (H|E) = Pold (H|E); and similarly for the negation of E, Pnew (H| ∼ E) = Pold (H| ∼ E). Then, by a theorem of probability theory, the new probability of H based on the experience itself Pnew (H) equals Pold (H|E)Pnew (E) + Pold (H| ∼ E)Pnew (∼ E). Thus, the Bayesian approach to updating on evidence need not be limited to cases in which a person becomes certain of an evidential proposition. The principle of conditionalization is handicapped in case the new evidence has a prior probability equal to zero. Then the probability of a hypothesis conditional on it is undefined. Supplementary principles may handle that special case. They may authorize a fresh start for probability assignments so that the prior probability of the evidence is discarded and a new, positive assignment takes its place. Also, conditional probabilities with respect to conditions of probability zero may be defined in terms of hypothetical conditionals. The second approach, the appeal to hypothetical probability assignments, may be warranted, in addition, as a method of addressing the problem of old evidence. That is the problem of explaining how, given the principle of conditionalization, evidence may confirm a hypothesis when the prior probability of the evidence equals 1 because the evidence was known in advance of the hypothesis’s formulation. The evidence does not boost the probability of the hypothesis according to the formula PE (H) = P (H|E) = P (H&E)/P (E) because if P (E) equals 1, then PE (H) = P (H). Nonetheless, evidence on hand may support a hypothesis. For example, facts about the planet Mercury’s orbit support general relative theory although those facts were known before Einstein formulated his theory. Conditionalization may handle this problem by comparing the hypothetical probability that a hypothesis has independently of a bit of evidence to the probability that conditionalization, P (H) and P (E) are called prior probabilities.
254
Paul Weirich
the hypothesis has given that evidence. A representation of a dynamic system of preferences that uses probabilities may make conditionalization a definitional truth. However, taking degrees of belief as implicitly defined theoretical entities makes conditionalization a norm for degrees of belief. Dynamic versions of the Dutch book argument, presented for example by Teller [1973] and Skyrms [1990], offer pragmatic reasons for complying with the principle of conditionalization. An epistemic justification of the principle, that is, a justification that grounds the principle in degree of belief’s role of representing strength of evidence, may be more desirable for the reasons Section 4 reviews. One attempt to justify conditionalization shows that conditionalization is very likely to lead eventually to high belief strengths for true hypotheses and low belief strengths for false hypotheses. However, this approach does not treat cases in which epistemic concerns focus on the present. In that case conclusions about the future carry no weight. A more promising general approach argues that strength of evidence obeys the principle of conditionalization and that rational degrees of belief match strength of evidence given standard idealizations.
5.2
Objections to Conditionalization
The literature presents many problem cases for the principle of conditionalization. Bayesians claim that ideal agents in ideal circumstances comply with the principle. Ideal agents are aware of all logical truths and do not suffer memory losses or other cognitive deficiencies. Some problem cases, such as those Arntzenius [2003] presents, consider whether conditionalization applies when idealizations are relaxed. Even when all idealizations are satisfied, conditionalization still faces problem cases. In such cases, often the best defense notes that the principle requires conditionalization with respect to exactly the new evidence and not some approximation to it. I elaborate this point with a brief look at Nicod’s principle of confirmation. Some probability theorists, such as Howson and Urbach [1989: 90], advance counterexamples to Jean Nicod’s principle of confirmation. That principle holds that discovering a true instance of a generalization confirms the generalization. A typical counterexample goes as follows. Suppose that a person believes that unicorns do not exist. Keeping in mind that a truth-functional conditional with a false antecedent is true, he believes the generalization that all unicorns are white, taken as (∀x)(U x → W x), where the arrow expresses the familiar truth-functional connective for formation of a conditional. Then the person discovers a white unicorn, Alexander. His degree of belief that the generalization is true declines. He rejects the generalization because he believes that unicorns, given that they exist, come in various colors, as horses do, and are not exclusively white. In this case a positive instance of a generalization lowers the generalization’s probability. The example not only challenges Nicod’s principle but also challenges the principle of conditionalization. According to conditionalization, the probability of a hypothesis H after acquiring evidence that E equals the probability of the hypoth-
The Bayesian Decision-Theoretic Approach to Statistics
255
esis conditional on the evidence. Let H be the generalization (∀x)(U x → W x), and let E be the positive instance (U a → W a) formed by letting a, for Alexander, replace x. By Bayes’s Theorem, P (H|E) = P (E|H)P (H)/P (E). Suppose that P (H) and P (E) are nonzero so that probabilities conditional on H and on E are defined. Because H entails E, P (E|H) = 1. Hence P (H|E) = P (H)/P (E). P (H)/P (E) > P (H) because P (E) < 1 prior to the discovery that E is true. Therefore P (H|E) > P (H). However, in the example, the probability of the generalization after acquiring the evidence is lower than the prior probability of the hypothesis, namely, P (H). So, apparently, updating does not replace P (H) with P (H|E). It does not proceed by conditionalization. Nonetheless, it is rational. Interpreting confirmation as an increase in probability, as explained and refined in Horwich [1982: Chap. 3], conditionalization entails Nicod’s principle. A generalization entails each instance. Hence if H is a generalization and E is any instance of it, P (H|E) > P (H), provided that P (H) 6= 0 and 0 < P (E) < 1. Recall that PE (H) is the probability of H after learning that E, which according to conditionalization equals the prior probability of H under the assumption that E, that is, P (H|E). Given that the inequality PE (H) > P (H) entails E’s confirmation of H, conditionalization entails that a positive instance confirms a generalization. Any counterexample to Nicod’s principle is also a counterexample to conditionalization. Paying attention to the details of conditionalization and Nicod’s principle sorts out the trouble. Conditionalization requires updating according to a probability conditional on the total evidence newly acquired. In the example, the instance (U a → W a) is not the total new evidence. Observation of the white unicorn Alexander yields evidence that (Ua & Wa). This conjunction is stronger than the instance. Given that E is (Ua & Wa), P (E|H) does not equal 1 because H does not entail E. Hence P (H|E) > P (H) does not follow. The counterexample to conditionalization dissolves. The counterexample to Nicod’s principle also dissolves given that the principle treats only cases in which a positive instance of a generalization is the total evidence newly acquired. Under a charitable interpretation, the principle has this implicit restriction because, obviously, if one were to acquire knowledge of a generalization’s positive instance along with other evidence, such as a raft of negative instances, the generalization need not be confirmed. Some objections to conditionalization trade on a common heuristic for calculating P (H|E). To obtain its value, one imagines the probability of H if one were to learn that E. By definition, P (H|E) is the ratio P (H&E)/P (E). This ratio equals the probability that H under the assumption that E. It need not equal the probability of H if E were learned. Learning that E often carries the extra information that one learned that E. In most cases that extra information is irrelevant, and P (H|E) equals the probability that H given that one learns that E. However, in some special cases the equality fails. In those cases a defense of conditionalization notes the distinction between conditionalizing on E and conditionalizing on the information one would have if one were to learn that E. Bertrand’s box paradox presents another challenge to conditionalization. Imag-
256
Paul Weirich
ine three boxes with two drawers apiece. Each drawer of the first box contains a gold medal. Each drawer of the second box contains a silver medal. One drawer of the third contains a gold medal, and the other drawer a silver medal. At random, a box is selected and one of its drawers is opened. If a gold medal appears, what is the probability that the third box was selected? The probability seems to be 1/2. By Bayes’s theorem and conditionalization its probability is 1/3. The inclination to assign probability 1/2 to selection of the third box may generate a temptation to reject conditionalization. However, one may explain away the inclination by attending to the process that yields a gold medal. Two selection processes come to mind. One may see obtaining a gold medal as selection of a box or as selection of a drawer. The latter correctly describes the effect of the random process that selects a box and then a drawer. Given that process, a gold medal is less probable from the third box than from the first, so the third box is less probable than the first. In many cases that seem to throw doubt on conditionalization, multiple ways of understanding an underlying selection mechanism fit the case. The various mechanisms yield different conditional probabilities to use when updating. Fully specifying the underlying selection mechanism resolves the challenge to conditionalization.
6
BAYESIAN AND CLASSICAL STATISTICS COMPARED
The previous sections present Bayesian statistics and its components. This section briefly compares Bayesian and classical statistics.
6.1
Subjectivity
Proponents of classical statistics often fault Bayesian methods for their reliance on subjective probabilities, in particular, subjective probabilities of a hypothesis and evidence prior to statistical tests of the hypothesis. Critics generally do not object to the kind of subjectivity that arises from investigators’ having different evidence. Although that relativity to evidence makes probability subjective, it is compatible with personal evidence’s objectively settling the hypothesis’s probability for a person. Critics object to another kind of relativity. They decline to leave latitude for the exercise of epistemic taste in assigning subjective prior probabilities. According taste a role in that assignment makes subjective probabilities dependent on personal psychology, not just on evidence. The usual defense of Bayesianism points out that some reliance on personal judgment is unavoidable in statistical inference. Classical statistics, although it uses only objective probabilities, also employs some procedures that depend on personal judgment, as the next subsection explains.
The Bayesian Decision-Theoretic Approach to Statistics
6.2
257
Statistical Tests
A classical statistical test of a hypothesis assigns a probability neither to the hypothesis tested nor to the test’s actual outcome. It assigns probabilities only to possible outcomes of the test given the hypothesis. A hypothesis is testable only if its supposition yields well-defined objective probabilities for a test’s possible outcomes. A test identifies a critical region formed by a set of possible outcomes. The test selects a critical region so that obtaining an outcome in the critical region has low objective probability given the hypothesis. According to the hypothesis, it is unlikely that the test’s outcome lies in the critical region. If the actual outcome falls into the critical region, then the test rejects the hypothesis, so the critical region may also be called the rejection region. For example, take the hypothesis that a coin has two Heads. When tossing the coin, the outcome Tails forms a critical region. A toss that yields Tails produces an outcome in that critical region. Obtaining that outcome thus leads to rejection of the hypothesis. A typical significance test of the sort R. A. Fisher pioneered may investigate, using a critical region for possible outcomes, whether a coin is fair. Fisher called the hypothesis being tested the null hypothesis. Suppose that an experimenter tosses the coin 20 times. One test statistic, a number derived from the test’s outcome, is the number of Heads. For a proposed test, one may compute the probability distribution of the number of Heads given the null hypothesis. Suppose that an experimenter sets the significance level at 5%, which means that she decides, in advance of performing the experimental test, to reject the hypothesis given that the test statistic falls into a critical region, containing outcomes of lowest probability given the hypothesis, that has an objective probability of only 5%. Then, on a test consisting of 20 tosses, the test will lead to rejection of the null hypothesis if the number of Heads is more than 14 or less than 6. Otherwise, it does not lead to rejection of the null hypothesis that the coin is fair. Such a significance test is a cornerstone of classical statistics. Classical statisticians are quick to point out that rejection does not justify belief that the hypothesis is false; and nonrejection does not justify belief that the hypothesis is true. For one thing, besides the test result, there may be other relevant evidence, for example, results from other tests. Similarly, an outcome’s falling inside the critical region does not justify concluding that the hypothesis is improbable. Nor does its falling outside the critical region justify concluding that the hypothesis is probable. Whether the hypothesis is improbable or probable in light of the test depends on prior probabilities that classical statistics does not entertain because it takes them to be non-objective and illegitimate. Does nonrejection justify willingness to act as if the hypothesis is true? That depends on circumstances. One should not unreservedly act as if the hypothesis is true because it may be false despite test results supporting it. Even a very thorough statistical test with excellent credentials may on occasion fail to reject a false hypothesis. To reduce the risk of acting wrongly, an experimenter may adjust the significance level that triggers rejection in a way that reflects the hypothesis’s
258
Paul Weirich
practical importance. She may set the significance level at 5% instead of 1% to make rejection more likely if nonrejection will lead to a weighty decision. However, that kind of pragmatism is only a rough guide to action. It is better to use the probability of the hypothesis to guide action, if that probability exists. Having that probability, the experimenter is in a position to apply the traditional decision principle of maximizing expected utility. Bayesians often point out that classical statistics has subjective components, and in that regard is in just as much “trouble” as Bayesian statistics. Suppose that in the example of tossing a coin 20 times, the number of Heads is 5. Given that the coin is fair, that is an improbable result, but so is any other number of Heads. Each possible number of Heads in 20 tosses is improbable given that the coin is fair, although some numbers of Heads are less improbable than are other numbers. For example, although both 10 Heads and 20 Heads are improbable numbers of Heads, 10 Heads is less improbable than 20 Heads because many sequences of tosses produce 10 Heads, whereas only one sequence produces 20 Heads. Rejection of the null hypothesis depends on the probability of a class of events to which a test’s outcome belongs, not just on the outcome’s probability given the hypothesis. A significance test rejects the hypothesis because the test’s outcome and all those possible outcomes with equal or less probability have probabilities that sum to less than some critical value such as 5%. The test rejects the coin’s fairness if and only if the number of Heads falls into the critical region of low probability. However, there are many other regions of equally low probability. Using a particular critical region is a matter of subjective choice. Howson and Urbach [1989: 160] report, “It seemed to Neyman and Pearson that if Fisher were merely interested in seeing whether the outcome of a trial fell in a region of low probability, he could, for example, have chosen for that region a narrow band in the centre of a bell-shaped distribution just as well as a wide band in its tails.” A test statistic for a hypothesis’s statistical test summarizes the possible outcomes that the test may yield. Different test statistics use different ways of describing possible outcomes. A test statistic usually discards much information about a test’s actual outcome. Suppose that the example’s test of a coin yields 6 Heads. That the test produced 6 Heads in 20 tosses is just one description of its outcome. Including the order of Heads and Tails in the 20 tosses yields another description. Statistical tests may use a wide variety of test statistics – many of which may seem quite “unnatural.” A test statistic may, for example, count as a single possible outcome the disjunctive result of obtaining 5 Heads or 10 Heads. It may also count as a single possible outcome the disjunctive result of obtaining 14 or 15 Heads. Using these two disjunctive outcomes to reduce the number of possible outcomes may affect whether the null hypothesis is rejected at the 5% level, as Howson and Urbach [1989: 130–132] explain. So test results (rejection or nonrejection) depend on which test statistic is selected. A test statistic is sufficient just in case it includes all information about possible test outcomes that is relevant to the hypothesis being tested. A minimally sufficient statistic has all the relevant information and no more than is relevant.
The Bayesian Decision-Theoretic Approach to Statistics
259
Some statisticians respond to the dependency of a test’s result on the test statistic selected by proposing use of only a minimally sufficient test statistic. However, whether this proposal adequately deals with the problem is controversial. In any case, Bayesians argue that classical statistics does not objectively ground selection of a test statistic and that selection of a test statistic is a subjective choice. The case that classical statistics has subjective components is a strong one. A statistical test’s critical region, test statistic, and significance level appear to be settled subjectively. So, arguably, classical statistics has no clear advantage over Bayesian statistics with respect to objectivity. Bayesianism is a robust theory of statistical inference. It has resources for meeting common objections to it. Its elaboration and refinement contribute enormously to the discipline of statistics and to the multitude of disciplines that employ statistical methods.6 7
APPENDIX: SAVAGE’S REPRESENTATION THEOREM
This appendix presents the axioms of preference on which Savage’s [1954/1972] famous representation theorem relies. Savage’s representation theorem assumes a set of states S with elements s, s′ , . . . and subsets A, B, C, . . . , and also a set of consequences F with elements f, g, h, . . . . For an agent, acts are arbitrary functions f, g, h, . . . from S to F . For acts f and g, the expression f ≤ g means that the agent does not prefer f to g. Savage’s theorem derives from postulates P1–P7, relying on definitions D1–D5. P1
The relation ≤ is a simple ordering.
D1
f ≤ g given B, if and only if f ′ ≤ g′ for every f ′ and g′ that agree with f and g, respectively, on B and with each other on ∼B and g′ ≤ f ′ either for all such pairs or for none.
P2
For every f, g, and B, f ≤ g given B or g ≤ f given B.
D2
g ≤ g ′ ; if and only if f ≤ f ′ , when f (s) = g, f ′ (s) = g ′ for every s ∈ S.
D3
B is null, if and only if f ≤ g given B for every f, g.
P3
If f (s) = g, f ′ (s) = g ′ for every s ∈ B, and B is not null; then f ≤ f ′ given B, if and only if g ≤ g ′ .
D4
A ≤ B; if and only if fA ≤ fB or g ≤ g ′ for every fA , fB , g, g ′ such that: fA (s) = g for s ∈ A, fA (s) = g ′ for s ∈ ∼ A, fB (s) = g, for s ∈ B, fB (s) = g ′ for s ∈ ∼ B.
P4
For every A, B, A ≤ B or B ≤ A.
6 I thank Kenneth Boyce for bibliographical research for this essay and also thank Prasanta Bandyopadhyay and James Hawthorne for helpful comments on a preliminary draft.
260
Paul Weirich
P5
It is false that, for every f, f ′ , f ≤ f ′ .
P6
Suppose it false that g ≤ h; then, for every f , there is a (finite) partition of S such that, if g′ agrees with g and h′ agrees with h except on an arbitrary element of the partition, g′ and h′ being equal to f there, then it will be false that g′ ≤ h or g ≤ h′ .
D5
f ≤ g given B(g ≤ f given B); if and only if f ≤ h given B(h ≤ f given B), when h(s) = g for every s.
P7
If f ≤ g(s) given B(g(s) ≤ f given B) for every s ∈ B, then f ≤ g given B(g ≤ f given B).
Using the postulates or axioms of preference, Savage shows that there is a probability function over states and a utility function over consequences (given a choice of scale) such that preferences among acts follow acts’ expected utilities. Savage (Sec. 3.3) first shows that a quantitative probability function represents preferences concerning gambles on events, or sets of states. Then he (Sec. 5.4) shows that a utility function (given a choice of scale) represents preferences concerning acts in general, assuming the probability function earlier established and the equality of the act’s utility with its expected utility. BIBLIOGRAPHY [Arntzenius, 2003] F. Arntzenius. Some Problems for Conditionalization and Reflection. Journal of Philosophy 100: 356–370, 2003. [Berger, 1993] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Second edition. New York: Springer, 1993. [Christensen, 1996] D. Christensen. Dutch Book Arguments Depragmatized: Epistemic Consistency for Partial Believers. Journal of Philosophy 93: 450–479, 1996. [De Finetti, 1964] B. De Finetti. Foresight: Its Logical Laws, Its Subjective Sources. In H. E. Kyburg, Jr. and H. E. Smokler, eds., Studies in Subjective Probability, pp. 93–158. New York: Wiley, 1964. Originally published in 1937 as La Pr´ evision: Ses Lois Logiques, Ses Sources Subjectives. Annales de l’Institut Henri Poincar´ e, 7: 1–68. [DeGroot, 1970] M. DeGroot. Optimal Statistical Decisions. New York: McGraw-Hill. 1970. [Edwards, 1992] A. W. F. Edwards. Likelihood. Expanded edition. Baltimore: Johns Hopkins University Press, 1992. [Gregory, 2005] P. C. Gregory. Bayesian Logical Data Analysis for the Physical Sciences. Cambridge: Cambridge University Press, 2005. [Horwich, 1982] P. Horwich. Probability and Evidence. Cambridge: Cambridge University Press, 1982. [Howson and Urbach, 1989] C. Howson and P. Urbach. Scientific Reasoning: The Bayesian Approach. LaSalle, IL: Open Court, 1989. [Freedman, 1997] D. A. Freedman. From Association to Causation via Regression. In V. R. Vaughn and S. P. Turner, eds., Causality in Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences, pp. 113–161. Notre Dame, IN: University of Notre Dame Press, 1997. [Jaynes, 2003] E. T. Jaynes. Probability Theory—The Logic of Science. G. L. Bretthorst, ed. Cambridge: Cambridge University Press, 2003. [Jeffrey, 1983] R. Jeffrey. The Logic of Decision. Second edition. Chicago: University of Chicago Press. First edition, 1965. New York: McGraw-Hill, 1983.
The Bayesian Decision-Theoretic Approach to Statistics
261
[Joyce, 1998] J. M. Joyce. A Nonpragmatic Vindication of Probabilism. Philosophy of Science 65: 575–603, 1998. [Joyce, 1999] J. M. Joyce. The Foundations of Causal Decision Theory. Cambridge: Cambridge University Press, 1999. [Kaplan, 1996] M. Kaplan. Decision Theory as Philosophy. Cambridge: Cambridge University Press, 1996. [Kemeny, 1955] J. Kemeny. Fair Bets and Inductive Probabilities. Journal of Symbolic Logic 20: 263–273, 1955. [Kennedy and Chichara, 1979] R. Kennedy and C. Chihara. The Dutch Book Argument: Its Logical Flaws, its Subjective Sources. Philosophical Studies 36: 19–33, 1979. [Kyburg, 1983] H. Kyburg. Epistemology and Inference. Minneapolis, MN: University of Minnesota Press, 1983. [Levi, 1967] I. Levi. Gambling with Truth: An Essay on Induction and the Aims of Science. New York: Knopf, 1967. [Levi, 2006] I. Levi. 2006. Replies. In E. Olsson, ed., Knowledge and Inquiry: Essays on the Pragmatism of Isaac Levi, pp. 227–380. Cambridge: Cambridge University Press, 2006. [Lewis, 1986] D. Lewis. Philosophical Papers. Vol. 2. New York: Oxford University Press, 1986. [Lindley, 1971] D. V. Lindley. Making Decisions. London: Wiley, 1971. [Maher, 1993] P. Maher. Betting on Theories. Cambridge: Cambridge University Press, 1993. [Maher, forthcoming] P. Maher. Bayesian Probability. Synthese, forthcoming. [Ramsey, 1990] F. P. Ramsey. Truth and Probability. In D. H. Mellor, ed., Philosophical Papers, pp. 52–109. Cambridge: Cambridge University Press. Originally published in 1931. In R. B. Braithwaite, ed., Foundations of Mathematics and other Essays, pp. 156–198. London: Routledge & Kegan Paul, 1990. [Rosenkrantz, 1981] R. D. Rosenkrantz. Foundations and Applications of Inductive Probability. Atascadero, CA: Ridgeview Publishing Company, 1981. [Savage, 1972] L. J. Savage. The Foundations of Statistics. Second edition. New York: Dover. First edition, 1954. New York: Wiley, 1972. [Skyrms, 1990] B. Skyrms. The Dynamics of Rational Deliberation. Cambridge, MA: Harvard University Press, 1990. [Shimony, 1988] A. Shimony. An Adamite Derivation of the Calculus of Probability. In J. H. Fetzer, ed., Probability and Causality, pp. 79–89. Dordrecht: Reidel, 1988. [Teller, 1973] P. Teller. Conditionalization and Observation. Synthese 26: 218–258, 1973. [van Faassen, 1983] B. Van Fraassen. Calibration: A Frequency Justification for Personal Probability. In R. S. Cohen and L. Laudan, eds., Physics, Philosophy, and Psychoanalysis: Essays in Honor of Adolf Grunbaum, pp. 295–319. Dordrecht: Reidel, 1983. [Wald, 1950] A. Wald. Statistical Decision Functions. New York: Wiley, 1950. [Weirich, 2001] P. Weirich. Decision Space: Multidimensional Utility Analysis. Cambridge: Cambridge University Press, 2001. [Weirich, 2004] P. Weirich. Realistic Decision Theory: Rules for Nonideal Agents in Nonideal Circumstances. New York: Oxford University Press, 2004.
This page intentionally left blank
MODERN BAYESIAN INFERENCE: FOUNDATIONS AND OBJECTIVE METHODS Jos´e M. Bernardo The field of statistics includes two major paradigms: frequentist and Bayesian. Bayesian methods provide a complete paradigm for both statistical inference and decision making under uncertainty. Bayesian methods may be derived from an axiomatic system and provide a coherent methodology which makes it possible to incorporate relevant initial information, and which solves many of the difficulties which frequentist methods are known to face. If no prior information is to be assumed, a situation often met in scientific reporting and public decision making, a formal initial prior function must be mathematically derived from the assumed model. This leads to objective Bayesian methods, objective in the precise sense that their results, like frequentist results, only depend on the assumed model and the data obtained. The Bayesian paradigm is based on an interpretation of probability as a rational conditional measure of uncertainty, which closely matches the sense of the word ‘probability’ in ordinary language. Statistical inference about a quantity of interest is described as the modification of the uncertainty about its value in the light of evidence, and Bayes’ theorem specifies how this modification should precisely be made. 1
INTRODUCTION
Scientific experimental or observational results generally consist of (possibly many) sets of data of the general form D = {x1 , . . . , xn }, where the xi ’s are somewhat “homogeneous” (possibly multidimensional) observations xi . Statistical methods are then typically used to derive conclusions on both the nature of the process which has produced those observations, and on the expected behaviour at future instances of the same process. A central element of any statistical analysis is the specification of a probability model which is assumed to describe the mechanism which has generated the observed data D as a function of a (possibly multidimensional) parameter (vector) ω ∈ Ω, sometimes referred to as the state of nature, about whose value only limited information (if any) is available. All derived statistical conclusions are obviously conditional on the assumed probability model. Unlike most other branches of mathematics, frequentist methods of statistical inference suffer from the lack of an axiomatic basis; as a consequence, their proposed desiderata are often mutually incompatible, and the analysis of the same data may well lead to incompatible results when different, apparently intuitive Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
264
Jos´ e M. Bernardo
procedures are tried; see Lindley [1972] and Jaynes [1976] for many instructive examples. In marked contrast, the Bayesian approach to statistical inference is firmly based on axiomatic foundations which provide a unifying logical structure, and guarantee the mutual consistency of the methods proposed. Bayesian methods constitute a complete paradigm to statistical inference, a scientific revolution in Kuhn’s sense. Bayesian statistics only require the mathematics of probability theory and the interpretation of probability which most closely corresponds to the standard use of this word in everyday language: it is no accident that some of the more important seminal books on Bayesian statistics, such as the works of de Laplace [1812], Jeffreys [1939] or de Finetti [1970] are actually entitled “Probability Theory”. The practical consequences of adopting the Bayesian paradigm are far reaching. Indeed, Bayesian methods (i) reduce statistical inference to problems in probability theory, thereby minimizing the need for completely new concepts, and (ii) serve to discriminate among conventional, typically frequentist statistical techniques, by either providing a logical justification to some (and making explicit the conditions under which they are valid), or proving the logical inconsistency of others. The main result from these foundations is the mathematical need to describe by means of probability distributions all uncertainties present in the problem. In particular, unknown parameters in probability models must have a joint probability distribution which describes the available information about their values; this is often regarded as the characteristic element of a Bayesian approach. Notice that (in sharp contrast to conventional statistics) parameters are treated as random variables within the Bayesian paradigm. This is not a description of their variability (parameters are typically fixed unknown quantities) but a description of the uncertainty about their true values. A most important particular case arises when either no relevant prior information is readily available, or that information is subjective and an “objective” analysis is desired, one that is exclusively based on accepted model assumptions and well-documented public prior information. This is addressed by reference analysis which uses information-theoretic concepts to derive formal reference prior functions which, when used in Bayes’ theorem, lead to posterior distributions encapsulating inferential conclusions on the quantities of interest solely based on the assumed model and the observed data. In this article it is assumed that probability distributions may be described through their probability density functions, and no distinction is made between a random quantity and the particular values that it may take. Bold italic roman fonts are used for observable random vectors (typically data) and bold italic greek fonts are used for unobservable random vectors (typically parameters); lower case is used for variables and calligraphic upper case for their dominion sets. Moreover, the standard mathematical convention of referring to functions, say f and g of x ∈ X , respectively by f (x) and g(x), will be used throughout. Thus, π(θ|D, C) and p(x|θ, C) respectively represent general probability densities of the unknown parameter θ ∈ Θ given data D and conditions C, and of the observable random
Modern Bayesian Inference: Foundations and Objective Methods
265
R vector x ∈ X conditional R on θ and C. Hence, π(θ|D, C) ≥ 0, Θ π(θ|D, C)dθ = 1, and p(x|θ, C) ≥ 0, X p(x|θ, C) dx = 1. This admittedly imprecise notation will greatly simplify the exposition. If the random vectors are discrete, these functions naturally become probability mass functions, and integrals over their values become sums. Density functions of specific distributions are denoted by appropriate names. Thus, if x is a random quantity with a normal distribution of mean µ and standard deviation σ, its probability density function will be denoted N(x|µ, σ). Bayesian methods make frequent use of the the concept of logarithmic divergence, a very general measure of the goodness of the approximation of a probability density p(x) by another density pˆ(x). The Kullback-Leibler, or logarithmic divergence of a probability density pˆ(x) of the random vector x ∈ X from its true R p(x)} dx. probability density p(x), is defined as κ{ˆ p(x)|p(x)} = X p(x) log{p(x)/ˆ It may be shown that (i) the logarithmic divergence is non-negative (and it is zero if, and only if, pˆ(x) = p(x) almost everywhere), and (ii) that κ{ˆ p(x)|p(x)} is invariant under one-to-one transformations of x. This article contains a brief summary of the mathematical foundations of Bayesian statistical methods (Section 2), an overview of the paradigm (Section 3), a detailed discussion of objective Bayesian methods (Section 4), and a description of useful objective inference summaries, including estimation and hypothesis testing (Section 5). Good introductions to objective Bayesian statistics include Lindley [1965], Zellner [1971], and Box and Tiao [1973]. For more advanced monographs, see [Berger, 1985; Bernardo and Smith, 1994].
2
FOUNDATIONS
A central element of the Bayesian paradigm is the use of probability distributions to describe all relevant unknown quantities, interpreting the probability of an event as a conditional measure of uncertainty, on a [0, 1] scale, about the occurrence of the event in some specific conditions. The limiting extreme values 0 and 1, which are typically inaccessible in applications, respectively describe impossibility and certainty of the occurrence of the event. This interpretation of probability includes and extends all other probability interpretations. There are two independent arguments which prove the mathematical inevitability of the use of probability distributions to describe uncertainties; these are summarized later in this section.
2.1
Probability as a Rational Measure of Conditional Uncertainty
Bayesian statistics uses the word probability in precisely the same sense in which this word is used in everyday language, as a conditional measure of uncertainty associated with the occurrence of a particular event, given the available information
266
Jos´ e M. Bernardo
and the accepted assumptions. Thus, Pr(E|C) is a measure of (presumably rational) belief in the occurrence of the event E under conditions C. It is important to stress that probability is always a function of two arguments, the event E whose uncertainty is being measured, and the conditions C under which the measurement takes place; “absolute” probabilities do not exist. In typical applications, one is interested in the probability of some event E given the available data D, the set of assumptions A which one is prepared to make about the mechanism which has generated the data, and the relevant contextual knowledge K which might be available. Thus, Pr(E|D, A, K) is to be interpreted as a measure of (presumably rational) belief in the occurrence of the event E, given data D, assumptions A and any other available knowledge K, as a measure of how “likely” is the occurrence of E in these conditions. Sometimes, but certainly not always, the probability of an event under given conditions may be associated with the relative frequency of “similar” events in “similar” conditions. The following examples are intended to illustrate the use of probability as a conditional measure of uncertainty. Probabilistic diagnosis. A human population is known to contain 0.2% of people infected by a particular virus. A person, randomly selected from that population, is subject to a test which is from laboratory data known to yield positive results in 98% of infected people and in 1% of non-infected, so that, if V denotes the event that a person carries the virus and + denotes a positive result, Pr(+|V ) = 0.98 and Pr(+|V ) = 0.01. Suppose that the result of the test turns out to be positive. Clearly, one is then interested in Pr(V |+, A, K), the probability that the person carries the virus, given the positive result, the assumptions A about the probability mechanism generating the test results, and the available knowledge K of the prevalence of the infection in the population under study (described here by Pr(V |K) = 0.002). An elementary exercise in probability algebra, which involves Bayes’ theorem in its simplest form (see Section 3), yields Pr(V |+, A, K) = 0.164. Notice that the four probabilities involved in the problem have the same interpretation: they are all conditional measures of uncertainty. Besides, Pr(V |+, A, K) is both a measure of the uncertainty associated with the event that the particular person who tested positive is actually infected, and an estimate of the proportion of people in that population (about 16.4%) that would eventually prove to be infected among those which yielded a positive test. ⊳ Estimation of a proportion. A survey is conducted to estimate the proportion θ of individuals in a population who share a given property. A random sample of n elements is analyzed, r of which are found to possess that property. One is then typically interested in using the results from the sample to establish regions of [0, 1] where the unknown value of θ may plausibly be expected to lie; this information is provided by probabilities of the form Pr(a < θ < b|r, n, A, K), a conditional measure of the uncertainty about the event that θ belongs to (a, b) given the information provided by the data (r, n), the assumptions A made on the behaviour of the mechanism which has generated the data (a random sample of n Bernoulli
Modern Bayesian Inference: Foundations and Objective Methods
267
trials), and any relevant knowledge K on the values of θ which might be available. For example, after a political survey in which 720 citizens out of a random sample of 1500 have declared their support to a particular political measure, one may conclude that Pr(θ < 0.5|720, 1500, A, K) = 0.933, indicating a probability of about 93% that a referendum on that issue would be lost. Similarly, after a screening test for an infection where 100 people have been tested, none of which has turned out to be infected, one may conclude that Pr(θ < 0.01|0, 100, A, K) = 0.844, or a probability of about 84% that the proportion of infected people is smaller than 1%. ⊳ Measurement of a physical constant. A team of scientists, intending to establish the unknown value of a physical constant µ, obtain data D = {x1 , . . . , xn } which are considered to be measurements of µ subject to error. The probabilities of interest are then typically of the form Pr(a < µ < b|x1 , . . . , xn , A, K), the probability that the unknown value of µ (fixed in nature, but unknown to the scientists) lies within an interval (a, b) given the information provided by the data D, the assumptions A made on the behaviour of the measurement mechanism, and whatever knowledge K might be available on the value of the constant µ. Again, those probabilities are conditional measures of uncertainty which describe the (necessarily probabilistic) conclusions of the scientists on the true value of µ, given available information and accepted assumptions. For example, after a classroom experiment to measure the gravitational field with a pendulum, a student may report (in m/sec2 ) something like Pr(9.788 < g < 9.829|D, A, K) = 0.95, meaning that, under accepted knowledge K and assumptions A, the observed data D indicate that the true value of g lies within 9.788 and 9.829 with probability 0.95, a conditional uncertainty measure on a [0,1] scale. This is naturally compatible with the fact that the value of the gravitational field at the laboratory may well be known with high precision from available literature or from precise previous experiments, but the student may have been instructed not to use that information as part of the accepted knowledge K. Under some conditions, it is also true that if the same procedure were actually used by many other students with similarly obtained data sets, their reported intervals would actually cover the true value of g in approximately 95% of the cases, thus providing a frequentist calibration of the student’s probability statement. ⊳ Prediction. An experiment is made to count the number r of times that an event E takes place in each of n replications of a well defined situation; it is observed that E does take place ri times in replication i, and it is desired to forecast the number of times r that E will take place in a similar future situation. This is a prediction problem on the value of an observable (discrete) quantity r, given the information provided by data D, accepted assumptions A on the probability mechanism which generates the ri ’s, and any relevant available knowledge K. Computation of the probabilities {Pr(r|r1 , . . . , rn , A, K)}, for r = 0, 1, . . ., is thus
268
Jos´ e M. Bernardo
required. For example, the quality assurance engineer of a firm which produces automobile restraint systems may report something like Pr(r = 0|r1 = . . . = r10 = 0, A, K) = 0.953, after observing that the entire production of airbags in each of n = 10 consecutive months has yielded no complaints from their clients. This should be regarded as a measure, on a [0, 1] scale, of the conditional uncertainty, given observed data, accepted assumptions and contextual knowledge, associated with the event that no airbag complaint will come from next month’s production and, if conditions remain constant, this is also an estimate of the proportion of months expected to share this desirable property. A similar problem may naturally be posed with continuous observables. For instance, after measuring some continuous magnitude in each of n randomly chosen elements within a population, it may be desired to forecast the proportion of items in the whole population whose magnitude satisfies some precise specifications. As an example, after measuring the breaking strengths {x1 , . . . , x10 } of 10 randomly chosen safety belt webbings to verify whether or not they satisfy the requirement of remaining above 26 kN, the quality assurance engineer may report something like Pr(x > 26|x1 , . . . , x10 , A, K) = 0.9987. This should be regarded as a measure, on a [0, 1] scale, of the conditional uncertainty (given observed data, accepted assumptions and contextual knowledge) associated with the event that a randomly chosen safety belt webbing will support no less than 26 kN. If production conditions remain constant, it will also be an estimate of the proportion of safety belts which will conform to this particular specification. Often, additional information of future observations is provided by related covariates. For instance, after observing the outputs {y1 , . . . , yn } which correspond to a sequence {x1 , . . . , xn } of different production conditions, it may be desired to forecast the output y which would correspond to a particular set x of production conditions. For instance, the viscosity of commercial condensed milk is required to be within specified values a and b; after measuring the viscosities {y1 , . . . , yn } which correspond to samples of condensed milk produced under different physical conditions {x1 , . . . , xn }, production engineers will require probabilities of the form Pr(a < y < b|x, (y1 , x1 ), . . . , (yn , xn ), A, K). This is a conditional measure of the uncertainty (always given observed data, accepted assumptions and contextual knowledge) associated with the event that condensed milk produced under conditions x will actually satisfy the required viscosity specifications. ⊳
2.2
Statistical Inference and Decision Theory
Decision theory not only provides a precise methodology to deal with decision problems under uncertainty, but its solid axiomatic basis also provides a powerful reinforcement to the logical force of the Bayesian approach. We now summarize the basic argument. A decision problem exists whenever there are two or more possible courses of action; let A be the class of possible actions. Moreover, for each a ∈ A, let Θa be the set of relevant events which may affect the result of choosing a, and let
Modern Bayesian Inference: Foundations and Objective Methods
269
c(a, θ) ∈ Ca , θ ∈ Θa , be the consequence of having chosen action a when event θ takes place. The class of pairs {(Θa , Ca ), a ∈ A} describes the structure of the decision problem. Without loss of generality, it may be assumed that the possible actions are mutually exclusive, for otherwise one would work with the appropriate Cartesian product. Different sets of principles have been proposed to capture a minimum collection of logical rules that could sensibly be required for “rational” decision-making. These all consist of axioms with a strong intuitive appeal; examples include the transitivity of preferences (if a1 > a2 given C, and a2 > a3 given C, then a1 > a3 given C), and the sure-thing principle (if a1 > a2 given C and E, and a1 > a2 given C and not E, then a1 > a2 given C). Notice that these rules are not intended as a description of actual human decision-making, but as a normative set of principles to be followed by someone who aspires to achieve coherent decision-making. There are naturally different options for the set of acceptable principles (see e.g. Ramsey 1926; Savage, 1954; DeGroot, 1970; Bernardo and Smith, 1994, Ch. 2 and references therein), but all of them lead basically to the same conclusions, namely: (i)
Preferences among consequences should be measured with a real-valued bounded utility function U (c) = U (a, θ) which specifies, on some numerical scale, their desirability.
(ii) The uncertainty of relevant events should be measured with a set of probability distributions {(π(θ|C, a), θ ∈ Θa ), a ∈ A} describing their plausibility given the conditions C under which the decision must be taken. (iii) The desirability of the available actions is measured by their corresponding expected utility Z (1) U (a|C) = U (a, θ) π(θ|C, a) dθ, a ∈ A. Θa
It is often convenient to work in terms of the non-negative loss function defined by (2) L(a, θ) = sup {U (a, θ)} − U (a, θ), a∈A
which directly measures, as a function of θ, the “penalty” for choosing a wrong action. The relative undesirability of available actions a ∈ A is then measured by their expected loss Z (3) L(a|C) = L(a, θ) π(θ|C, a) dθ, a ∈ A. Θa
Notice that, in particular, the argument described above establishes the need to quantify the uncertainty about all relevant unknown quantities (the actual values of the θ’s), and specifies that this quantification must have the mathematical structure of probability distributions. These probabilities are conditional on the
270
Jos´ e M. Bernardo
circumstances C under which the decision is to be taken, which typically, but not necessarily, include the results D of some relevant experimental or observational data. It has been argued that the development described above (which is not questioned when decisions have to be made) does not apply to problems of statistical inference, where no specific decision making is envisaged. However, there are two powerful counterarguments to this. Indeed, (i) a problem of statistical inference is typically considered worth analyzing because it may eventually help make sensible decisions; a lump of arsenic is poisonous because it may kill someone, not because it has actually killed someone [Ramsey, 1926], and (ii) it has been shown [Bernardo, 1979a] that statistical inference on θ actually has the mathematical structure of a decision problem, where the class of alternatives is the functional space Z π(θ|D) dθ = 1 (4) A = π(θ|D); π(θ|D) > 0, Θ
of the conditional probability distributions of θ given the data, and the utility function is a measure of the amount of information about θ which the data may be expected to provide.
2.3
Exchangeability and Representation Theorem
Available data often take the form of a set {x1 , . . . , xn } of “homogeneous” (possibly multidimensional) observations, in the precise sense that only their values matter and not the order in which they appear. Formally, this is captured by the notion of exchangeability. The set of random vectors {x1 , . . . , xn } is exchangeable if their joint distribution is invariant under permutations. An infinite sequence {xj } of random vectors is exchangeable if all its finite subsequences are exchangeable. Notice that, in particular, any random sample from any model is exchangeable in this sense. The concept of exchangeability, introduced by de Finetti [1937], is central to modern statistical thinking. Indeed, the general representation theorem implies that if a set of observations is assumed to be a subset of an exchangeable sequence, then it constitutes a random sample from some probability model {p(x|ω), ω ∈ Ω}, x ∈ X , described in terms of (labeled by) some parameter vector ω; furthermore this parameter ω is defined as the limit (as n → ∞) of some function of the observations. Available information about the value of ω in prevailing conditions C is necessarily described by some probability distribution π(ω|C). For example, in the case of a sequence {x1 , x2 , . . .} of dichotomous exchangeable random quantities xj ∈ {0, 1}, de Finetti’s representation theorem establishes that the joint distribution of (x1 , . . . , xn ) has an integral representation of the form (5) p(x1 , . . . , xn |C) =
Z
n 1Y
0 i=1
θxi (1 − θ)1−xi π(θ|C) dθ,
r , n→∞ n
θ = lim
Modern Bayesian Inference: Foundations and Objective Methods
271
P where r = xj is the number of positive trials. This is nothing but the joint distribution of a set of (conditionally) independent Bernoulli trials with parameter θ, over which some probability distribution π(θ|C) is therefore proven to exist. More generally, for sequences of arbitrary random quantities {x1 , x2 , . . .}, exchangeability leads to integral representations of the form Z Y n p(xi |ω) π(ω|C) dω, (6) p(x1 , . . . , xn |C) = Ω i=1
where {p(x|ω), ω ∈ Ω} denotes some probability model, ω is the limit as n → ∞ of some function f (x1 , . . . , xn ) of the observations, and π(ω|C) is some probability distribution over Ω. This formulation includes “nonparametric” (distribution free) modelling, where ω may index, for instance, all continuous probability distributions on X . Notice that π(ω|C) does not describe a possible variability of ω (since ω will typically be a fixed unknown vector), but a description on the uncertainty associated with its actual value. Under appropriate conditioning, exchangeability is a very general assumption, a powerful extension of the traditional concept of a random sample. Indeed, many statistical analyses directly assume data (or subsets of the data) to be a random sample of conditionally independent observations from some probability model, Qn so that p(x , . . . , x |ω) = p(x |ω); but any random sample is exchangeable, n i i=1 Qn 1 since i=1 p(xi |ω) is obviously invariant under permutations. Notice that the observations in a random sample are only independent conditional on the parameter value ω; as nicely put by Lindley, the mantra that the observations {x1 , . . . , xn } in a random sample are independent is ridiculous when they are used to infer xn+1 . Notice also that, under exchangeability, the general representation theorem provides an existence theorem for a probability distribution π(ω|C) on the parameter space Ω, and that this is an argument which only depends on mathematical probability theory. Another important consequence of exchangeability is that it provides a formal definition of the parameter ω which labels the model as the limit, as n → ∞, of some function f (x1 , . . . , xn ) of the observations; the function f obviously depends both on the assumed model and the chosen parametrization. For instance, in the case of a sequence of Bernoulli trials, the parameter θ is defined as the limit, as n → ∞, of the relative frequency r/n. It follows that, under exchangeability, the sentence “the true value of ω” has a well-defined meaning, if only asymptotically verifiable. Moreover, if two different models have parameters which are functionally related by their definition, then the corresponding posterior distributions may be meaningfully compared, for they refer to functionally related quantities. For instance, if a finite subset {x1 , . . . , xn } of an exchangeable sequence of integer observations is assumed to be a random sample from a Poisson distribution Po(x|λ), P so that E[x|λ] = λ, then λ is defined as limn→∞ {¯ xn }, where x ¯n = j xj /n; similarly, if for some fixed non-zero integer r, the same data are assumed to be a random sample for a negative binomial Nb(x|r, θ), so that E[x|θ, r] = r(1 − θ)/θ, then θ is defined as limn→∞ {r/(¯ xn + r)}. It follows that θ ≡ r/(λ + r) and, hence,
272
Jos´ e M. Bernardo
θ and r/(λ + r) may be treated as the same (unknown) quantity whenever this might be needed as, for example, when comparing the relative merits of these alternative probability models.
3
THE BAYESIAN PARADIGM
The statistical analysis of some observed data D typically begins with some informal descriptive evaluation, which is used to suggest a tentative, formal probability model {p(D|ω), ω ∈ Ω} assumed to represent, for some (unknown) value of ω, the probabilistic mechanism which has generated the observed data D. The arguments outlined in Section 2 establish the logical need to assess a prior probability distribution π(ω|K) over the parameter space Ω, describing the available knowledge K about the value of ω prior to the data being observed. It then follows from standard probability theory that, if the probability model is correct, all available information about the value of ω after the data D have been observed is contained in the corresponding posterior distribution whose probability density, π(ω|D, A, K), is immediately obtained from Bayes’ theorem, (7) π(ω|D, A, K) = R
p(D|ω) π(ω|K) , p(D|ω) π(ω|K) dω Ω
where A stands for the assumptions made on the probability model. It is this systematic use of Bayes’ theorem to incorporate the information provided by the data that justifies the adjective Bayesian by which the paradigm is usually known. It is obvious from Bayes’ theorem that any value of ω with zero prior density will have zero posterior density. Thus, it is typically assumed (by appropriate restriction, if necessary, of the parameter space Ω) that prior distributions are strictly positive (as Savage put it, keep the mind open, or at least ajar). To simplify the presentation, the accepted assumptions A and the available knowledge K are often omitted from the notation, but the fact that all statements about ω given D are also conditional to A and K should always be kept in mind. EXAMPLE 1 Bayesian inference with a finite parameter space. Let p(D|θ), θ ∈ {θ1 , . . . , θm }, be the probability mechanism which is assumed to have generated the observed data D, so that θ may only take a finite number of values. Using the finite form of Bayes’ theorem, and omitting the prevailing conditions from the notation, the posterior probability of θi after data D have been observed is (8)
p(D|θi ) Pr(θi ) Pr(θi |D) = Pm , j=1 p(D|θj ) Pr(θj )
i = 1, . . . , m.
For any prior distribution p(θ) = {Pr(θ1 ), . . . , Pr(θm )} describing available knowledge about the value of θ, Pr(θi |D) measures how likely should θi be judged, given both the initial knowledge described by the prior distribution, and the information provided by the data D.
Modern Bayesian Inference: Foundations and Objective Methods
273
1 0.8 0.6 0.4 0.2 0.2
0.4
0.6
0.8
1
Figure 1. Posterior probability of infection Pr(V |+) given a positive test, as a function of the prior probability of infection Pr(V ) An important, frequent application of this simple technique is provided by probabilistic diagnosis. For example, consider the simple situation where a particular test designed to detect a virus is known from laboratory research to give a positive result in 98% of infected people and in 1% of non-infected. Then, the posterior probability that a person who tested positive is infected is given by Pr(V |+) = (0.98 p)/{0.98 p + 0.01 (1 − p)} as a function of p = Pr(V ), the prior probability of a person being infected (the prevalence of the infection in the population under study). Figure 1 shows Pr(V |+) as a function of Pr(V ). As one would expect, the posterior probability is only zero if the prior probability is zero (so that it is known that the population is free of infection) and it is only one if the prior probability is one (so that it is known that the population is universally infected). Notice that if the infection is rare, then the posterior probability of a randomly chosen person being infected will be relatively low even if the test is positive. Indeed, for say Pr(V ) = 0.002, one finds Pr(V |+) = 0.164, so that in a population where only 0.2% of individuals are infected, only 16.4% of those testing positive within a random sample will actually prove to be infected: most positives would actually be false positives. In this section, we describe in some detail the learning process described by Bayes’ theorem, discuss its implementation in the presence of nuisance parameters, show how it can be used to forecast the value of future observations, and analyze its large sample behaviour.
3.1
The Learning Process
In the Bayesian paradigm, the process of learning from the data is systematically implemented by making use of Bayes’ theorem to combine the available prior
274
Jos´ e M. Bernardo
information with the information provided by the data to produce the required posterior distribution. Computation of posterior densities is often facilitated by noting that Bayes’ theorem may be simply expressed as (9) π(ω|D) ∝ p(D|ω) π(ω), (where ∝ stands for ‘proportional to’ and where, for simplicity, the accepted assumptions A and the available knowledge K have been omitted from the notation), R since the missing proportionality constant [ Ω p(D|ω) π(ω) dω]−1 may always be deduced from the fact that π(ω|D), a probability density, must integrate to one. Hence, to identify the form of a posterior distribution it suffices to identify a kernel of the corresponding probability density, that is a function k(ω) such that π(ω|D) = c(D) k(ω) for some c(D) which does not involve ω. In the examples which follow, this technique will often be used. R An improper prior function is defined as a positive function π(ω) such that π(ω) dω is not finite. Equation (9), the formal expression of Bayes’ theoΩ rem, remains technically valid if π(ω) is an improper prior function provided that R p(D|ω) π(ω) dω < ∞, thus leading to a well defined proper posterior density Ω π(ω|D) ∝ p(D|ω) π(ω). In particular, as will later be justified (Section 4) it also remains philosophically valid if π(ω) is an appropriately chosen reference (typically improper) prior function. Considered as a function of ω, l(ω, D) = p(D|ω) is often referred to as the likelihood function. Thus, Bayes’ theorem is simply expressed in words by the statement that the posterior is proportional to the likelihood times the prior. It follows from equation (9) that, provided the same prior π(ω) is used, two different data sets D1 and D2 , with possibly different probability models p1 (D1 |ω) and p2 (D2 |ω) but yielding proportional likelihood functions, will produce identical posterior distributions for ω. This immediate consequence of Bayes theorem has been proposed as a principle on its own, the likelihood principle, and it is seen by many as an obvious requirement for reasonable statistical inference. In particular, for any given prior π(ω), the posterior distribution does not depend on the set of possible data values, or the sample space. Notice, however, that the likelihood principle only applies to inferences about the parameter vector ω once the data have been obtained. Consideration of the sample space is essential, for instance, in model criticism, in the design of experiments, in the derivation of predictive distributions, and in the construction of objective Bayesian procedures. Naturally, the terms prior and posterior are only relative to a particular set of data. As one would expect from the coherence induced by probability theory, if data D = {x1 , . . . , xn } are sequentially presented, the final result will be the same whether data are globally or sequentially processed. Indeed, π(ω|x1 , . . . , xi+1 ) ∝ p(xi+1 |ω) π(ω|x1 , . . . , xi ), for i = 1, . . . , n − 1, so that the “posterior” at a given stage becomes the “prior” at the next. In most situations, the posterior distribution is “sharper” than the prior so that, in most cases, the density π(ω|x1 , . . . , xi+1 ) will be more concentrated around the true value of ω than π(ω|x1 , . . . , xi ). However, this is not always the case: oc-
Modern Bayesian Inference: Foundations and Objective Methods
275
casionally, a “surprising” observation will increase, rather than decrease, the uncertainty about the value of ω. For instance, in probabilistic diagnosis, a sharp posterior probability distribution (over the possible causes {ω1 , . . . , ωk } of a syndrome) describing, a “clear” diagnosis of disease ωi (that is, a posterior with a large probability for ωi ) would typically update to a less concentrated posterior probability distribution over {ω1 , . . . , ωk } if a new clinical analysis yielded data which were unlikely under ωi . For a given probability model, one may find that a particular function of the data t = t(D) is a sufficient statistic in the sense that, given the model, t(D) contains all information about ω which is available in D. Formally, t = t(D) is sufficient if (and only if) there exist nonnegative functions f and g such that the likelihood function may be factorized in the form p(D|ω) = f (ω, t)g(D). A sufficient statistic always exists, for t(D) = D is obviously sufficient; however, a much simpler sufficient statistic, with a fixed dimensionality which is independent of the sample size, often exists. In fact this is known to be the case whenever the probability model belongs to the generalized exponential family, which includes many of the more frequently used probability models. It is easily established that if t is sufficient, the posterior distribution of ω only depends on the data D through t(D), and may be directly computed in terms of p(t|ω), so that, π(ω|D) = p(ω|t) ∝ p(t|ω) π(ω). Naturally, for fixed data and model assumptions, different priors lead to different posteriors. Indeed, Bayes’ theorem may be described as a data-driven probability transformation machine which maps prior distributions (describing prior knowledge) into posterior distributions (representing combined prior and data knowledge). It is important to analyze whether or not sensible changes in the prior would induce noticeable changes in the posterior. Posterior distributions based on reference “noninformative” priors play a central role in this sensitivity analysis context. Investigation of the sensitivity of the posterior to changes in the prior is an important ingredient of the comprehensive analysis of the sensitivity of the final results to all accepted assumptions which any responsible statistical study should contain. EXAMPLE 2 Inference on a binomial parameter. If the data D consist of n Bernoulli observations with parameter θ which contain r positive trials, then p(D|θ, n) = θr (1 − θ)n−r , so that t(D) = {r, n} is sufficient. Suppose that prior knowledge about θ is described by a Beta distribution Be(θ|α, β), so that π(θ|α, β) ∝ θα−1 (1 − θ)β−1 . Using Bayes’ theorem, the posterior density of θ is π(θ|r, n, α, β) ∝ θr (1 − θ)n−r θα−1 (1 − θ)β−1 ∝ θr+α−1 (1 − θ)n−r+β−1 , the Beta distribution Be(θ|r + α, n − r + β). Suppose, for example, that in the light of precedent surveys, available information on the proportion θ of citizens who would vote for a particular political measure in a referendum is described by a Beta distribution Be(θ|50, 50), so that it is judged to be equally likely that the referendum would be won or lost, and it is judged that the probability that either side wins less than 60% of the vote is 0.95.
276
Jos´ e M. Bernardo
30 25 20 15 10 5 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Figure 2. Prior and posterior densities of the proportion θ of citizens that would vote in favour of a referendum
A random survey of size 1500 is then conducted, where only 720 citizens declare to be in favour of the proposed measure. Using the results above, the corresponding posterior distribution is then Be(θ|770, 830). These prior and posterior densities are plotted in Figure 2; it may be appreciated that, as one would expect, the effect of the data is to drastically reduce the initial uncertainty on the value of θ and, hence, on the referendum outcome. More precisely, Pr(θ < 0.5|720, 1500, H, K) = 0.933 (shaded region in Figure 2) so that, after the information from the survey has been included, the probability that the referendum will be lost should be judged to be about 93%. The general situation where the vector of interest is not the whole parameter vector ω, but some function θ = θ(ω) of possibly lower dimension than ω, will now be considered. Let D be some observed data, let {p(D|ω), ω ∈ Ω} be a probability model assumed to describe the probability mechanism which has generated D, let π(ω) be a probability distribution describing any available information on the value of ω, and let θ = θ(ω) ∈ Θ be a function of the original parameters over whose value inferences based on the data D are required. Any valid conclusion on the value of the vector of interest θ will then be contained in its posterior probability distribution π(θ|D) which is conditional on the observed data D and will naturally also depend, although not explicitly shown in the notation, on the assumed model {p(D|ω), ω ∈ Ω}, and on the available prior information encapsulated by π(ω). The required posterior distribution p(θ|D) is found by standard use of probability calculus. Indeed, by Bayes’ theorem, π(ω|D) ∝ p(D|ω) π(ω). Moreover, let λ = λ(ω) ∈ Λ be some other function of the original parameters such that ψ = {θ, λ} is a one-to-one transformation of ω, and let J(ω) = (∂ψ/∂ω) be the corresponding Jacobian matrix. Naturally, the introduction of λ is not necessary if θ(ω) is a one-to-one transformation of ω. Using standard change-of-variable probability
Modern Bayesian Inference: Foundations and Objective Methods
277
techniques, the posterior density of ψ is π(ω|D) (10) π(ψ|D) = π(θ, λ|D) = |J(ω)| ω=ω(ψ) and the required posterior of θ is the appropriate marginal density, obtained by integration over the nuisance parameter λ, Z π(θ, λ|D) dλ. (11) π(θ|D) = Λ
Notice that elimination of unwanted nuisance parameters, a simple integration within the Bayesian paradigm is, however, a difficult (often polemic) problem for frequentist statistics. Sometimes, the range of possible values of ω is effectively restricted by contextual considerations. If ω is known to belong to Ωc ⊂ Ω, the prior distribution is only positive in Ωc and, using Bayes’ theorem, it is immediately found that the restricted posterior is (12) π(ω|D, ω ∈ Ωc ) = R
π(ω|D) , π(ω|D) Ωc
ω ∈ Ωc ,
and obviously vanishes if ω ∈ / Ωc . Thus, to incorporate a restriction on the possible values of the parameters, it suffices to renormalize the unrestricted posterior distribution to the set Ωc ⊂ Ω of parameter values which satisfy the required condition. Incorporation of known constraints on the parameter values, a simple renormalization within the Bayesian pardigm, is another very difficult problem for conventional statistics. For further details on the elimination of nuisance parameters see [Liseo, 2005].
EXAMPLE 3 Inference on normal parameters. Let D = {x1 , . . . xn } be a random sample from a normal distribution N (x|µ, σ). The corresponding likelihood function is immediately found to P be proportional to σ −n exp[−n{s2 +(¯ x −µ)2 }/(2σ 2 )], P with n¯ x = i xi , and ns2 = i (xi − x ¯)2 . It may be shown (see Section 4) that absence of initial information on the value of both µ and σ may formally be described by a joint prior function which is uniform in both µ and log(σ), that is, by the (improper) prior function π(µ, σ) = σ −1 . Using Bayes’ theorem, the corresponding joint posterior is (13) π(µ, σ|D) ∝ σ −(n+1) exp[−n{s2 + (¯ x − µ)2 }/(2σ 2 )]. Thus, using the Gamma integral in terms of λ = σ −2 to integrate out σ, Z ∞ n (14) π(µ|D) ∝ x − µ)2 ] dσ ∝ [s2 + (¯ x − µ)2 ]−n/2 , σ −(n+1) exp − 2 [s2 + (¯ 2σ 0 √ which is recognized as a kernel of the Student density St(µ|¯ x, s/ n − 1, n − 1). Similarly, integrating out µ,
278
Jos´ e M. Bernardo
40 30 20 10 9.75
9.8
9.85
9.9
9.75
9.8
9.85
9.9
40 30 20 10 9.7
Figure 3. Posterior density π(g|m, s, n) of the value g of the gravitational field, given n = 20 normal measurements with mean m = 9.8087 and standard deviation s = 0.0428, (a) with no additional information, and (b) with g restricted to Gc = {g; 9.7803 < g < 9.8322}. Shaded areas represent 95%-credible regions of g (15) π(σ|D) ∝
ns2 n x − µ)2 ] dµ ∝ σ −n exp − 2 . σ −(n+1) exp − 2 [s2 + (¯ 2σ 2σ −∞
Z
∞
2
Changing variables to the precision λ = σ −2 results in π(λ|D) ∝ λ(n−3)/2 ens λ/2 , a kernel of the Gamma density Ga(λ|(n − 1)/2, ns2 /2). In terms of the standard deviation σ this becomes π(σ|D) = p(λ|D)|∂λ/∂σ| = 2σ −3 Ga(σ −2 |(n − 1)/2, ns2 /2), a square-root inverted gamma density. A frequent example of this scenario is provided by laboratory measurements made in conditions where central limit conditions apply, so that (assuming no experimental bias) those measurements may be treated as a random sample from a normal distribution centered at the quantity µ which is being measured, and with some (unknown) standard deviation σ. Suppose, for example, that in an elementary physics classroom experiment to measure the gravitational field g with a pendulum, a student has obtained n = 20 measurements of g yielding (in m/sec2 ) a mean x ¯ = 9.8087, and a standard deviation s = 0.0428. Using no other information, the corresponding posterior distribution is π(g|D) = St(g|9.8087, 0.0098, 19) represented in Figure 3(a). In particular, Pr(9.788 < g < 9.829|D) = 0.95, so that, with the information provided by this experiment, the gravitational field at the location of the laboratory may be expected to lie between 9.788 and 9.829 with
Modern Bayesian Inference: Foundations and Objective Methods
279
probability 0.95. Formally, the posterior distribution of g should be restricted to g > 0; however, as immediately obvious from Figure 3a, this would not have any appreciable effect, due to the fact that the likelihood function is actually concentrated on positive g values. Suppose now that the student is further instructed to incorporate into the analysis the fact that the value of the gravitational field g at the laboratory is known to lie between 9.7803 m/sec2 (average value at the Equator) and 9.8322 m/sec2 (average value at the poles). The updated posterior distribution will the be √ St(g|m, s/ n − 1, n) √ (16) π(g|D, g ∈ Gc ) = R , g ∈ Gc , St(g|m, s/ n − 1, n) g∈Gc
represented in Figure 3(b), where Gc = {g; 9.7803 < g < 9.8322}. One-dimensional numerical integration may be used to verify that Pr(g > 9.792|D, g ∈ Gc ) = 0.95. Moreover, if inferences about the standard deviation σ of the measurement procedure are also requested, the corresponding posterior distribution is found to be π(σ|D) = 2σ −3 Ga(σ −2 |9.5, 0.0183). This has a mean E[σ|D] = 0.0458 and yields Pr(0.0334 < σ < 0.0642|D) = 0.95.
3.2
Predictive Distributions
Let D = {x1 , . . . , xn }, xi ∈ X , be a set of exchangeable observations, and consider now a situation where it is desired to predict the value of a future observation x ∈ X generated by the same random mechanism that has generated the data D. It follows from the foundations arguments discussed in Section 2 that the solution to this prediction problem is simply encapsulated by the predictive distribution p(x|D) describing the uncertainty on the value that x will take, given the information provided by D and any other available knowledge. Suppose that contextual information suggests the assumption that data D may be considered to be a random sample from a distribution in the family {p(x|ω), ω ∈ Ω}, and let π(ω) be a prior distribution describing available information on the value of ω. Since p(x|ω, D) = p(x|ω), it then follows from standard probability theory that Z (17) p(x|D) = p(x|ω) π(ω|D) dω, Ω
which is an average of the probability distributions of x conditional on the (unknown) value of ω, weighted with the posterior distribution of ω given D. If the assumptions on the probability model are correct, the posterior predictive distribution p(x|D) will converge, as the sample size increases, to the distribution p(x|ω) which has generated the data. Indeed, the best technique to assess the quality of the inferences about ω encapsulated in π(ω|D) is to check against the observed data the predictive distribution p(x|D) generated by π(ω|D). For a good introduction to Bayesian predictive inference, see Geisser [1993].
280
Jos´ e M. Bernardo
EXAMPLE 4 Prediction in a Poisson process. Let D = {r1 , . . . , rn } be a random sample from a Poisson P distribution Pn(r|λ) with parameter λ, so that p(D|λ) ∝ λt e−λn , where t = ri . It may be shown (see Section 4) that absence of initial information on the value of λ may be formally described by the (improper) prior function π(λ) = λ−1/2 . Using Bayes’ theorem, the corresponding posterior is (18) π(λ|D) ∝ λt e−λn λ−1/2 ∝ λt−1/2 e−λn , the kernel of a Gamma density Ga(λ|, t + 1/2, n), with mean (t + 1/2)/n. The corresponding predictive distribution is the Poisson-Gamma mixture Z ∞ 1 nt+1/2 1 Γ(r + t + 1/2) . (19) p(r|D) = Pn(r|λ) Ga(λ|, t + , n) dλ = 2 Γ(t + 1/2) r! (1 + n)r+t+1/2 0 Suppose, for example, that in a firm producing automobile restraint systems, the entire production in each of 10 consecutive months has yielded no complaint from their clients. With no additional information on the average number λ of complaints per month, the quality assurance department of the firm may report that the probabilities that r complaints will be received in the next month of production are given by equation (19), with t = 0 and n = 10. In particular, p(r = 0|D) = 0.953, p(r = 1|D) = 0.043, and p(r = 2|D) = 0.003. Many other situations may be described with the same model. For instance, if metereological conditions remain similar in a given area, p(r = 0|D) = 0.953 would describe the chances of no flash flood next year, given 10 years without flash floods in the area. EXAMPLE 5 Prediction in a Normal process. Consider now prediction of a continuous variable. Let D = {x1 , . . . , xn } be a random sample from a normal distribution N (x|µ, σ). As mentioned in Example 3, absence of initial information on the values of both µ and σ is formally described by the improper prior function π(µ, σ) = σ −1 , and this leads to the joint posterior density (13). The corresponding (posterior) predictive distribution is r Z ∞Z ∞ n+1 (20) p(x|D) = N(x|µ, σ) π(µ, σ|D) dµdσ = St(x|¯ x, s , n − 1). n−1 0 −∞ If µ is known to be positive, the appropriate prior function will be the restricted function −1 σ if µ > 0 (21) π(µ, σ) = 0 otherwise. However, the result in equation (19) will still hold, provided the likelihood function p(D|µ, σ) is concentrated on positive µ values. Suppose, for example, that in the firm producing automobile restraint systems, the observed breaking strengths of n = 10 randomly chosen safety belt webbings have mean x ¯ = 28.011 kN and standard deviation s = 0.443 kN, and that the relevant engineering specification requires breaking strengths to be larger than 26 kN. If data may truly be assumed to be a random sample from a normal distribution, the likelihood function is only
Modern Bayesian Inference: Foundations and Objective Methods
281
appreciable for positive µ values, and only the information provided by this small sample is to be used, then the quality engineer may claim that the probability that a safety belt randomly chosen from the same batch as the sample tested would satisfy the required specification is Pr(x > 26|D) = 0.9987. Besides, if production conditions remain constant, 99.87% of the safety belt webbings may be expected to have acceptable breaking strengths.
3.3
Asymptotic Behaviour
The behaviour of posterior distributions when the sample size is large is now considered. This is important for, at least, two different reasons: (i) asymptotic results provide useful first-order approximations when actual samples are relatively large, and (ii) objective Bayesian methods typically depend on the asymptotic properties of the assumed model. Let D = {x1 , . . . , xn }, x ∈ X , be a random sample of size n from {p(x|ω), ω ∈ Ω}. It may be shown that, as n → ∞, the posterior distribution of a discrete parameter ω typically converges to a degenerate distribution which gives probability one to the true value of ω, and that the posterior distribution of a continuous parameter ω typically converges to a normal distribution centered at its maximum likelihood estimate ω ˆ (MLE), with a variance matrix which decreases with n as 1/n. Consider first the situation where Ω = {ω1 , ω2 , . . .} consists of a countable (possibly infinite) set of values, such that the probability model which corresponds to the true parameter value ωt is distinguishable from the others in the sense that the logarithmic divergence κ{p(x|ωi )|p(x|ωt )} of each of the p(x|ωi ) from p(x|ωt ) is strictly positive. Taking logarithms in Bayes’ theorem, defining zj = log[p(xj |ωi )/p(xj |ωt )], j = 1, . . . , n, and using the strong law of large numbers on the n conditionally independent and identically distributed random quantities z1 , . . . , zn , it may be shown that (22) lim Pr(ωt |x1 , . . . , xn ) = 1, n→∞
lim Pr(ωi |x1 , . . . , xn ) = 0,
n→∞
i 6= t.
Thus, under appropriate regularity conditions, the posterior probability of the true parameter value converges to one as the sample size grows. Consider now the situation where ω is a k-dimensional continuous parameter. Pn Expressing P Bayes’ theorem as π(ω|x1 , . . . , xn ) ∝ exp{log[π(ω)]+ j=1 log[p(xj |ω)]}, expanding j log[p(xj |ω)] about its maximum (the MLE ω ˆ ), and assuming regularity conditions (to ensure that terms of order higher than quadratic may be ignored and that the sum of the terms from the likelihood will dominate the term from the prior) it is found that the posterior density of ω is the approximate k-variate normal X n ∂ 2 log[p(xl |ω)] −1 (23) π(ω|x1 , . . . , xn ) ≈ Nk {ˆ ω , S(D, ω ˆ )}, S (D, ω) = − . ∂ωi ∂ωj l=1
A simpler, but somewhat poorer, approximation may be obtained by using the
282
Jos´ e M. Bernardo
strong law of large numbers on the sums in (22) to establish that S−1 (D, ω ˆ) ≈ n F(ˆ ω ), where F(ω) is Fisher’s information matrix, with general element Z ∂ 2 log[p(x|ω)] (24) Fij (ω) = − p(x|ω) dx, ∂ωi ∂ωj X so that (25) π(ω|x1 , . . . , xn ) ≈ Nk (ω|ˆ ω , n−1 F−1 (ˆ ω )). Thus, under appropriate regularity conditions, the posterior probability density of the parameter vector ω approaches, as the sample size grows, a multivarite normal density centered at the MLE ω ˆ , with a variance matrix which decreases with n as n−1 . EXAMPLE 2, continued. Asymptotic approximation with binomial data. Let D = (x1 , . . . , xn ) consist of n independent Bernoulli trials with parameter θ, so that p(D|θ, n) = θr (1 − θ)n−r . This likelihood function is maximized at θˆ = r/n, and Fisher’s information function is F (θ) = θ−1 (1 − θ)−1 . Thus, using the results above, the posterior distribution of θ will be the approximate normal, √ ˆ s(θ)/ ˆ (26) π(θ|r, n) ≈ N(θ|θ, n), s(θ) = {θ(1 − θ)}1/2 ˆ − θ)/n. ˆ with mean θˆ = r/n and variance θ(1 This will provide a reasonable approximation to the exact posterior if (i) the prior π(θ) is relatively “flat” in the region where the likelihood function matters, and (ii) both r and n are moderately large. If, say, n = 1500 and r = 720, this leads to π(θ|D) ≈ N(θ|0.480, 0.013), and to Pr(θ > 0.5|D) ≈ 0.940, which may be compared with the exact value Pr(θ > 0.5|D) = 0.933 obtained from the posterior distribution which corresponds to the prior Be(θ|50, 50). ⊳ It follows from the joint posterior asymptotic behaviour of ω and from the properties of the multivariate normal distribution that, if the parameter vector is decomposed into ω = (θ, λ), and Fisher’s information matrix is correspondingly partitioned, so that (27) F(ω) = F(θ, λ) = ( Fθθ (θ, λ)
Fθλ (θ, λ)Fλθ (θ, λ)
Fλλ (θ, λ) )
and (28) S(θ, λ) = F−1 (θ, λ) = ( Sθθ (θ, λ)
Sθλ (θ, λ)Sλθ (θ, λ)
Sλλ (θ, λ) ) ,
then the marginal posterior distribution of θ will be ˆ n−1 Sθθ (θ, ˆ λ)}, ˆ (29) π(θ|D) ≈ N{θ|θ, while the conditional posterior distribution of λ given θ will be ˆ − F−1 (θ, λ)F ˆ λθ (θ, λ)( ˆ θˆ − θ), n−1 F−1 (θ, λ)}. ˆ (30) π(λ|θ, D) ≈ N{λ|λ λλ λλ
Modern Bayesian Inference: Foundations and Objective Methods
283
Notice that F−1 λλ = Sλλ if (and only if) F is block diagonal, i.e. if (and only if) θ and λ are asymptotically independent. EXAMPLE 3, continued. Asymptotic approximation with normal data. Let D = (x1 , . . . , xn ) be a random sample from a normal distribution N(x|µ, σ). The corresponding likelihood function p(D|µ, σ) is maximized at (ˆ µ, σ ˆ ) = (¯ x, s), and −2 Fisher’s information matrix is diagonal, with F = σ . Hence, the posterior √ µµ n); this may be compared with the distribution of µ is approximately N(µ|¯ x , s/ √ exact result π(µ|D) = St(µ|¯ x, s/ n − 1, n − 1) obtained previously under the assumption of no prior knowledge. ⊳ 4
REFERENCE ANALYSIS
Under the Bayesian paradigm, the outcome of any inference problem (the posterior distribution of the quantity of interest) combines the information provided by the data with relevant available prior information. In many situations, however, either the available prior information on the quantity of interest is too vague to warrant the effort required to have it formalized in the form of a probability distribution, or it is too subjective to be useful in scientific communication or public decision making. It is therefore important to be able to identify the mathematical form of a “noninformative” prior, a prior that would have a minimal effect, relative to the data, on the posterior inference. More formally, suppose that the probability mechanism which has generated the available data D is assumed to be p(D|ω), for some ω ∈ Ω, and that the quantity of interest is some real-valued function θ = θ(ω) of the model parameter ω. Without loss of generality, it may be assumed that the probability model is of the form (31) M = {p(D|θ, λ), D ∈ D, θ ∈ Θ, λ ∈ Λ} p(D|θ, λ), where λ is some appropriately chosen nuisance parameter vector. As described in Section 3, to obtain the required posterior distribution of the quantity of interest π(θ|D) it is necessary to specify a joint prior π(θ, λ). It is now required to identify the form of that joint prior πθ (θ, λ|M, P), the θ-reference prior, which would have a minimal effect on the corresponding posterior distribution of θ, Z p(D|θ, λ) πθ (θ, λ|M, P) dλ, (32) π(θ|D) ∝ Λ
within the class P of all the prior disributions compatible with whatever information about (θ, λ) one is prepared to assume, which may just be the class P0 of all strictly positive priors. To simplify the notation, when there is no danger of confusion the reference prior πθ (θ, λ|M, P) is often simply denoted by π(θ, λ), but its dependence on the quantity of interest θ, the assumed model M and the class P of priors compatible with assumed knowledge, should always be kept in mind. To use a conventional expression, the reference prior “would let the data speak for themselves” about the likely value of θ. Properly defined, reference posterior
284
Jos´ e M. Bernardo
distributions have an important role to play in scientific communication, for they provide the answer to a central question in the sciences: conditional on the assumed model p(D|θ, λ), and on any further assumptions of the value of θ on which there might be universal agreement, the reference posterior π(θ|D) should specify what could be said about θ if the only available information about θ were some welldocumented data D and whatever information (if any) one is prepared to assume by restricting the prior to belong to an appropriate class P. Much work has been done to formulate “reference” priors which would make the idea described above mathematically precise. For historical details, see [Bernardo and Smith, 1994, Sec. 5.6.2; Kass and Wasserman, 1996; Bernardo, 2005a] and references therein. This section concentrates on an approach that is based on information theory to derive reference distributions which may be argued to provide the most advanced general procedure available; this was initiated by Bernardo [1979b; 1981] and further developed by Berger and Bernardo [1989; 1992a; 1982b; 1982c; 1997; 2005a; Bernardo and Ram´on, 1998; Berger et al., 2009], and references therein. In the formulation described below, far from ignoring prior knowledge, the reference posterior exploits certain well-defined features of a possible prior, namely those describing a situation were relevant knowledge about the quantity of interest (beyond that universally accepted, as specified by the choice of P) may be held to be negligible compared to the information about that quantity which repeated experimentation (from a specific data generating mechanism M) might possibly provide. Reference analysis is appropriate in contexts where the set of inferences which could be drawn in this possible situation is considered to be pertinent. Any statistical analysis contains a fair number of subjective elements; these include (among others) the data selected, the model assumptions, and the choice of the quantities of interest. Reference analysis may be argued to provide an “objective” Bayesian solution to statistical inference problems in just the same sense that conventional statistical methods claim to be “objective”: in that the solutions only depend on model assumptions and observed data.
4.1
Reference Distributions
One parameter. Consider the experiment which consists of the observation of data D, generated by a random mechanism p(D|θ) which only depends on a real-valued parameter θ ∈ Θ, and let t = t(D) ∈ T be any sufficient statistic (which may well be the complete data set D). In Shannon’s general information theory, the amount of information I θ {T, π(θ)} which may be expected to be provided by D, or (equivalently) by t(D), about the value of θ is defined by Z π(θ|t) π(θ|t) log (33) I θ {T, π(θ)} = κ {p(t)π(θ)|p(t|θ)π(θ)} = Et dθ , π(θ) Θ the expected logarithmic divergence of the prior from the posterior. This is naturally a functional of the prior π(θ): the larger the prior information, the smaller the information which the data may be expected to provide. The functional
Modern Bayesian Inference: Foundations and Objective Methods
285
I θ {T, π(θ)} is concave, non-negative, and invariant under one-to-one transformations of θ. Consider now the amount of information I θ {T k , π(θ)} about θ which may be expected from the experiment which consists of k conditionally independent replications {t1 , . . . , tk } of the original experiment. As k → ∞, such an experiment would provide any missing information about θ which could possibly be obtained within this framework; thus, as k → ∞, the functional I θ {T k , π(θ)} will approach the missing information about θ associated with the prior p(θ). Intuitively, a θ-“noninformative” prior is one which maximizes the missing information about θ. Formally, if πk (θ) denotes the prior density which maximizes I θ {T k , π(θ)} in the class P of s prior distributions which are compatible with accepted assumptions on the value of θ (which may well be the class P0 of all strictly positive proper priors) then the θ-reference prior π(θ|M, P) is the limit as k → ∞ (in a sense to be made precise) of the sequence of priors {πk (θ), k = 1, 2, . . .}. Notice that this limiting procedure is not some kind of asymptotic approximation, but an essential element of the definition of a reference prior. In particular, this definition implies that reference distributions only depend on the asymptotic behaviour of the assumed probability model, a feature which actually simplifies their actual derivation. EXAMPLE 6 Maximum entropy. If θ may only take a finite number of values, so that the parameter space is Θ = {θ1 , . . . , θm } and π(θ) = {p1 , . . . , pm }, with pi = Pr(θ = θi ), and there is no topology associated to the parameter space Θ, so that the θi ’s are just labels with no quantitative meaning, then the missing information associated to {p1 , . . . , pm } reduces to Xm (34) lim I θ {T k , π(θ)} = H(p1 , . . . , pm ) = − pi log(pi ), k→∞
i=1
that is, the entropy of the prior distribution {p1 , . . . , pm }. Thus, in the non-quantitative finite case, the reference prior π(θ|M, P) is that with maximum entropy in the class P of priors compatible with accepted assumptions. Consequently, the reference prior algorithm contains “maximum entropy” priors as the particular case which obtains when the parameter space is a finite set of labels, the only case where the original concept of entropy as a measure of uncertainty is unambiguous and well-behaved. In particular, if P is the class P0 of all priors over {θ1 , . . . , θm }, then the reference prior is the uniform prior over the set of possible θ values, π(θ|M, P0 ) = {1/m, . . . , 1/m}.
Formally, the reference prior function π(θ|M, P) of a univariate parameter θ is defined to be the limit of the sequence of the proper priors πk (θ) which maximize I θ {T k , π(θ)} in the precise sense that, for any value of the sufficient statistic t = t(D), the reference posterior, the intrinsic1 limit π(θ|t) of the corresponding sequence of posteriors {πk (θ|t)}, may be obtained from π(θ|M, P) by formal use of Bayes theorem, so that π(θ|t) ∝ p(t|θ) π(θ|M, P).
1 A sequence {π (θ|t)} of posterior distributions converges intrinsically to a limit π(θ|t) if the k sequence of expected intrinsic discrepancies Et [δ{πk (θ|t), π(θ|t)}] converges to 0, where δ{p, q} = R min{k(p|q), k(q|p)}, and k(p|q) = Θ q(θ) log[q(θ)/p(θ)]dθ. For details, see [Berger et al., 2009].
286
Jos´ e M. Bernardo
Reference prior functions are often simply called reference priors, even though they are usually not probability distributions. They should not be considered as expressions of belief, but technical devices to obtain (proper) posterior distributions which are a limiting form of the posteriors which could have been obtained from possible prior beliefs which were relatively uninformative with respect to the quantity of interest when compared with the information which data could provide. If (i) the sufficient statistic t = t(D) is a consistent estimator θ˜ of a continuous parameter θ, and (ii) the class P contains all strictly positive priors, then the reference prior may be shown to have a simple form in terms of any asymptotic approximation to the posterior distribution of θ. Notice that, by construction, an asymptotic approximation to the posterior does not depend on the prior. Specifically, if the posterior density π(θ|D) has an asymptotic approximation of the form ˜ n), the (unrestricted) reference prior is simply π(θ|θ, ˜ . (35) π(θ|M, P0 ) ∝ π(θ|θ, n) ˜ θ=θ
One-parameter reference priors are invariant under reparametrization; thus, if ψ = ψ(θ) is a piecewise one-to-one function of θ, then the ψ-reference prior is simply the appropriate probability transformation of the θ-reference prior.
EXAMPLE 7 Jeffreys’ prior. If θ is univariate and continuous, and the posterior distribution of θ given {x1 . . . , xn } is asymptotically normal with standard ˜ √n, then, using (34), the reference prior function is π(θ) ∝ s(θ)−1 . deviation s(θ)/ Under regularity conditions (often satisfied in practice, see Section 3.3), the posˆ where terior distribution of θ is asymptotically normal with variance n−1 F −1 (θ), ˆ F (θ) is Fisher’s information function and θ is the MLE of θ. Hence, the reference prior function in these conditions is π(θ|M, P0 ) ∝ F (θ)1/2 , which is known as Jeffreys’ prior. It follows that the reference prior algorithm contains Jeffreys’ priors as the particular case which obtains when the probability model only depends on a single continuous univariate parameter, there are regularity conditions to guarantee asymptotic normality, and there is no additional information, so that the class of possible priors is the set P0 of all strictly positive priors over Θ. These are precisely the conditions under which there is general agreement on the use of Jeffreys’ prior as a “noninformative” prior. EXAMPLE 2, continued. Reference prior for a binomial parameter. Let data D = {x1 , . . . , xn } consist of a sequence of n independent Bernoulli trials, so that p(x|θ) = θx (1 − θ)1−x , x ∈ {0, 1}; this is a regular, one-parameter continuous model, whose Fisher’s information function is F (θ) = θ−1 (1 − θ)−1 . Thus, the reference prior π(θ) is proportional to θ−1/2 (1 − θ)−1/2 , so that the reference prior is the (proper) Beta distribution Be(θ|1/2, 1/2). Since the reference algorithm √ is invariant under reparametrization, the reference prior of φ(θ) = 2 arc sin θ is π(φ) = π(θ)/|∂φ/∂/θ| = 1; thus, the reference prior is uniform on the variance√ stabilizing transformation φ(θ) = 2arc sin θ, a feature generally true under reg-
Modern Bayesian Inference: Foundations and Objective Methods
287
600 500 400 300 200 100 0.01
0.02
0.03
0.04
0.05
Figure 4. Posterior distribution of the proportion of infected people in the population, given the results of n = 100 tests, none of which were positive ularity conditions. In terms of θ, the P reference posterior is π(θ|D) = π(θ|r, n) = Be(θ|r + 1/2, n − r + 1/2), where r = xj is the number of positive trials. Suppose, for example, that n = 100 randomly selected people have been tested for an infection and that all tested negative, so that r = 0. The reference posterior distribution of the proportion θ of people infected is then the Beta distribution Be(θ|0.5, 100.5), represented in Figure 4. It may well be known that the infection was rare, leading to the assumption that θ < θ0 , for some upper bound θ0 ; the (restricted) reference prior would then be of the form π(θ) ∝ θ−1/2 (1 − θ)−1/2 if θ < θ0 , and zero otherwise. However, provided the likelihood is concentrated in the region θ < θ0 , the corresponding posterior would virtually be identical to Be(θ|0.5, 100.5). Thus, just on the basis of the observed experimental results, one may claim that the proportion of infected people is surely smaller than 5% (for the reference posterior probability of the event θ > 0.05 is 0.001), that θ is smaller than 0.01 with probability 0.844 (area of the shaded region in Figure 4), that it is equally likely to be over or below 0.23% (for the median, represented by a vertical line, is 0.0023), and that the probability that a person randomly chosen from the population is infected is 0.005 (the posterior mean, represented in the figure by a black circle), since Pr(x = 1|r, n) = E[θ|r, n] = 0.005. If a particular point estimate of θ is required (say a number to be quoted in the summary headline) the intrinsic estimator suggests itself (see Section 5); this is found to be θ∗ = 0.0032 (represented in the figure with a white circle). Notice that the traditional solution to this problem, based on the asymptotic behaviour of the MLE, here θˆ = r/n = 0 for any n, makes absolutely no sense in this scenario. ⊳ One nuisance parameter. The extension of the reference prior algorithm to the case of two parameters follows the usual mathematical procedure of reducing the problem to a sequential application of the established procedure for the single
288
Jos´ e M. Bernardo
parameter case. Thus, if the probability model is p(t|θ, λ), θ ∈ Θ, λ ∈ Λ and a θ-reference prior πθ (θ, λ|M, P) is required, the reference algorithm proceeds in two steps: (i)
Conditional on θ, p(t|θ, λ) only depends on the nuisance parameter λ and, hence, the one-parameter algorithm may be used to obtain the conditional reference prior π(λ|θ, M, P).
(ii) If π(λ|θ, M, P) is proper, this may be used to integrate out the nuisance paRrameter thus obtaining the one-parameter integrated model p(t|θ) = p(t|θ, λ) π(λ|θ, M, P) dλ, to which the one-parameter algorithm may be Λ applied again to obtain π(θ|M, P). The θ-reference prior is then πθ (θ, λ|M, P) = π(λ|θ, M, P) π(θ|M, P), and the required reference posterior is π(θ|t) ∝ p(t|θ) π(θ|M, P). If the conditional reference prior is not proper, then the procedure is performed within an increasing sequence {Λi } of subsets converging to Λ over which π(λ|θ) is integrable. This makes it possible to obtain a corresponding sequence of θ-reference posteriors {πi (θ|t} for the quantity of interest θ, and the required reference posterior is the corresponding intrinsic limit π(θ|t) = limi πi (θ|t). A θ-reference prior is then defined as a positive function πθ (θ, λ) which may be formally used in Bayes’ theorem as a prior to obtain the reference posterior, i.e. such that,R for any sufficient t ∈ T (which may well be the whole data set D) π(θ|t) ∝ Λ p(t|θ, λ) πθ (θ, λ) dλ. The approximating sequences should be consistently chosen within a given model. Thus, given a probability model {p(x|ω), ω ∈ Ω} an appropriate approximating sequence {Ωi } should be chosen for the whole parameter space Ω. Thus, if the analysis is done in terms of, say, ψ = {ψ1 , ψ2 } ∈ Ψ(Ω), the approximating sequence should be chosen such that Ψi = ψ(Ωi ). A natural approximating sequence in location-scale problems is {µ, log σ} ∈ [−i, i]2 . The θ-reference prior does not depend on the choice of the nuisance parameter λ; thus, for any ψ = ψ(θ, λ) such that (θ, ψ) is a one-to-one function of (θ, λ), the θreference prior in terms of (θ, ψ) is simply πθ (θ, ψ) = πθ (θ, λ)/|∂(θ, ψ)/∂(θ, λ)|, the appropriate probability transformation of the θ-reference prior in terms of (θ, λ). Notice, however, that the reference prior may depend on the parameter of interest; thus, the θ-reference prior may differ from the φ-reference prior unless either φ is a piecewise one-to-one transformation of θ, or φ is asymptotically independent of θ. This is an expected consequence of the fact that the conditions under which the missing information about θ is maximized are not generally the same as the conditions which maximize the missing information about an arbitrary function φ = φ(θ, λ). The non-existence of a unique “noninformative prior” which would be appropriate for any inference problem within a given model was established by Dawid, Stone and Zidek [1973], when they showed that this is incompatible with consistent marginalization. Indeed, if given the model p(D|θ, λ), the reference posterior
Modern Bayesian Inference: Foundations and Objective Methods
289
of the quantity of interest θ, π(θ|D) = π(θ|t), only depends on the data through a statistic t whose sampling distribution, p(t|θ, λ) = p(t|θ), only depends on θ, one would expect the reference posterior to be of the form π(θ|t) ∝ π(θ) p(t|θ) for some prior π(θ). However, examples were found where this cannot be the case if a unique joint “noninformative” prior were to be used for all possible quantities of interest. EXAMPLE 8 Regular two dimensional continuous reference prior functions. If the joint posterior distribution of (θ, λ) is asymptotically normal, then the θ-reference prior may be derived in terms of the corresponding Fisher’s information matrix, F(θ, λ). Indeed, if Fθθ (θ, λ) Fθλ (θ, λ) , and S(θ, λ) = F−1 (θ, λ), (36) F(θ, λ) = Fθλ (θ, λ) Fλλ (θ, λ) then the unrestricted θ-reference prior is πθ (θ, λ|M, P0 ) = π(λ|θ) π(θ), where 1/2
(37) π(λ|θ) ∝ Fλλ (θ, λ),
λ ∈ Λ.
If π(λ|θ) is proper, Z −1/2 (38) π(θ) ∝ exp π(λ|θ) log[Sθθ (θ, λ)] dλ , Λ
θ ∈ Θ.
If π(λ|θ) is not proper, integrations are performed on an approximating sequence {Λi } to obtain a sequence {πi (λ|θ) πi (θ)}, (where πi (λ|θ) is the proper renormalization of π(λ|θ) to Λi ) and the θ-reference prior πθ (θ, λ) is defined as its appropriate 1/2 −1/2 limit. Moreover, if (i) both Fλλ (θ, λ) and Sθθ (θ, λ) factorize, so that −1/2
(39) Sθθ
(θ, λ) ∝ fθ (θ) gθ (λ),
1/2
Fλλ (θ, λ) ∝ fλ (θ) gλ (λ),
and (ii) the parameters θ and λ are variation independent, so that Λ does not depend on θ, then the θ-reference prior is simply πθ (θ, λ) = fθ (θ) gλ (λ), even if the conditional reference prior π(λ|θ) = π(λ) ∝ gλ (λ) (which will not depend on θ) is actually improper. EXAMPLE 3, continued. Reference priors for the normal model. The information matrix which corresponds to a normal model N(x|µ, σ) is 2 −2 σ 0 σ 0 −1 ; , S(µ, σ) = F (µ, σ) = (40) F(µ, σ) = 1 2 0 2σ −2 0 2σ √ 1/2 hence Fσσ (µ, σ) = 2 σ −1 = fσ (µ) gσ (σ), with gσ (σ) = σ −1 , and thus π(σ|µ) = −1/2 σ −1 . Similarly, Sµµ (µ, σ) = σ −1 = fµ (µ) gµ (σ), with fµ (µ) = 1, and thus π(µ) = 1. Therefore, the µ-reference prior is πµ (µ, σ|M, P0 ) = π(σ|µ) π(µ) = σ −1 , as already anticipated. Moreover, as one would expect from the fact that F(µ, σ) is diagonal and also anticipated, it is similarly found that the σ-reference prior is πσ (µ, σ|M, P0 ) = σ −1 , the same as before.
290
Jos´ e M. Bernardo
Suppose, however, that the quantity of interest is not the mean µ or the standard deviation σ, but the standardized mean φ = µ/σ. Fisher’s information matrix in terms of the parameters φ and σ is F(φ, σ) = J t F(µ, σ) J, where J = (∂(µ, σ)/∂(φ, σ)) is the Jacobian of the inverse transformation; this yields 1 + 21 φ2 − 12 φσ 1 φσ −1 , S(φ, σ) = (41) F(φ, σ) = . 1 2 φσ −1 σ −2 (2 + φ2 ) − 21 φσ 2σ −1/2
1/2
Thus, Sφφ (φ, σ) ∝ (1 + 21 φ2 )−1/2 and Fσσ (φ, σ) ∝ σ −1 (2 + φ2 )1/2 . Hence, using again the results in Example 8, πφ (φ, σ|M, P0 ) = (1 + 12 φ2 )−1/2 σ −1 . In the original parametrization, this is πφ (µ, σ|M, P0 ) = (1 + 21 (µ/σ)2 )−1/2 σ −2 , which is very different from πµ (µ, σ|M, P0 ) = πσ (µ, σ|M, P0 ) = σ −1 . The corresponding reference posterior of φ is π(φ|x1 , . . . , xn ) ∝ (1 + 12 φ2 )−1/2 p(t|φ) where P P t = ( xj )/( x2j )1/2 , a one-dimensional (marginally sufficient) statistic whose sampling distribution, p(t|µ, σ) = p(t|φ), only depends on φ. Thus, the reference prior algorithm is seen to be consistent under marginalization. ⊳ Many parameters. The reference algorithm is easily generalized to an arbitrary number of parameters. If the model is p(t|ω1 , . . . , ωm ), a joint reference prior (42) π(θm |θm−1 , . . . , θ1 ) × . . . × π(θ2 |θ1 ) × π(θ1 ) may sequentially be obtained for each ordered parametrization {θ1 (ω), . . . , θm (ω)} of interest, and these are invariant under reparametrization of any of the θi (ω)’s. The choice of the ordered parametrization {θ1 , . . . , θm } precisely describes the particular prior required, namely that which sequentially maximizes the missing information about each of the θi ’s, conditional on {θ1 , . . . , θi−1 }, for i = m, m − 1, . . . , 1.
EXAMPLE 9 Stein’s paradox. Let D be a random sample from a m-variate normal distribution with mean µ = {µ1 , . . . , µm } and unitary variance matrix. The reference prior which corresponds to any permutation of the µi ’s is uniform, and this prior leads indeed to appropriate reference posterior distributions for any of √ n). Suppose, however, that the quantity the µi ’s, namely π(µ |D) = N (µ |¯ x , 1/ i i i P 2 of interest is θ = µ , the distance of µ to the origin. As showed by Stein i i [1959], the posterior distribution of θ based on that uniform prior (or in any “flat” proper approximation) has very undesirable properties; this is due to the fact that a uniform (or nearly uniform) prior, although “noninformative” with respect to each of the individual µi ’s, is actually highly informative on the sum of their squares, introducing a severe positive bias (Stein’s paradox). However, the reference prior which corresponds to a parametrization of the form {θ, λ1 , . . . , λm−1 } produces, for any choice of the nuisance parameters λi = λiP (µ), the reference posterior π(θ|D) = π(θ|t) ∝ θ−1/2 χ2 (nt|m, nθ), where t = ¯2i , and this posterior is ix shown to have the appropriate consistency properties.
Far from being specific to Stein’s example, the inappropriate behaviour in problems with many parameters of specific marginal posterior distributions derived
Modern Bayesian Inference: Foundations and Objective Methods
291
from multivariate “flat” priors (proper or improper) is indeed very frequent. Hence, sloppy, uncontrolled use of “flat” priors (rather than the relevant reference priors), is very strongly discouraged. Limited information Although often used in contexts where no universally agreed prior knowledge about the quantity of interest is available, the reference algorithm may be used to specify a prior which incorporates any acceptable prior knowledge; it suffices to maximize the missing information within the class P of priors which is compatible with such accepted knowledge. Indeed, by progressive incorporation of further restrictions into P, the reference prior algorithm becomes a method of (prior) probability assessment. As described below, the problem has a fairly simple analytical solution when those restrictions take the form of known expected values. The incorporation of other type of restrictions usually involves numerical computations. EXAMPLE 10 Univariate restricted reference priors. If the probability mechanism which is assumed to have generated the available data only depends on a univarite continuous parameter θ ∈ Θ ⊂ ℜ, and the class P of acceptable priors is a class of proper priors which satisfies some expected value restrictions, so that Z Z (43) P = π(θ); π(θ) > 0, π(θ) dθ = 1, gi (θ) π(θ) dθ = βi , i = 1, . . . , m Θ
Θ
then the (restricted) reference prior is h Xm i (44) π(θ|M, P) ∝ π(θ|M, P0 ) exp γi gi (θ) j=1
where π(θ|M, P0 ) is the unrestricted reference prior and the γi ’s are constants (the corresponding Lagrange multipliers), to be determined by the restrictions which define P. Suppose, for instance, that data are considered to be a random sample from a location model centered at θ, and that it is further assumed that E[θ] = µ0 and that Var[θ] = σ02 . The unrestricted reference prior for any regular location problem may be shown to be uniform, so that here π(θ|M, P0 ) = 1. Thus, the restrictedRreference prior must be of the R form π(θ|M, P) ∝ exp{γ1 θ + γ2 (θ − µ0 )2 }, with Θ θ π(θ|M, P) dθ = µ0 and Θ (θ − µ0 )2 π(θ|M, P) dθ = σ02 . Hence, π(θ|M, P) is the normal distribution with the specified mean and variance, N(θ|µ0 , σ0 ).
4.2
Frequentist Properties
Bayesian methods provide a direct solution to the problems typically posed in statistical inference; indeed, posterior distributions precisely state what can be said about unknown quantities of interest given available data and prior knowledge. In particular, unrestricted reference posterior distributions state what could be said if no prior knowledge about the quantities of interest were available.
292
Jos´ e M. Bernardo
A frequentist analysis of the behaviour of Bayesian procedures under repeated sampling may, however, be illuminating, for this provides some interesting connections between frequentist and Bayesian inference. It is found that the frequentist properties of Bayesian reference procedures are typically excellent, and may be used to provide a form of calibration for reference posterior probabilities. Point Estimation It is generally accepted that, as the sample size increases, a “good” estimator θ˜ of θ ought to get the correct value of θ eventually, that is to be consistent. Under appropriate regularity conditions, any Bayes estimator φ∗ of any function φ(θ) converges in probability to φ(θ), so that sequences of Bayes estimators are typically consistent. Indeed, it is known that if there is a consistent sequence of estimators, then Bayes estimators are consistent. The rate of convergence is often best for reference Bayes estimators. It is also generally accepted that a “good” estimator should be admissible, that is, not dominated by any other estimator in the sense that its expected loss under sampling (conditional to θ) cannot be larger for all θ values than that corresponding to another estimator. Any proper Bayes estimator is admissible; moreover, as established by Wald [1950], a procedure must be Bayesian (proper or improper) to be admissible. Most published admissibility results refer to quadratic loss functions, but they often extend to more general loss funtions. Reference Bayes estimators are typically admissible with respect to appropriate loss functions. Notice, however, that many other apparently intuitive frequentist ideas on estimation have been proved to be potentially misleading. For example, given a sequence of n Bernoulli observations with parameter θ resulting in r positive trials, the best unbiased estimate of θ2 is found to be r(r −1)/{n(n−1)}, which yields θ˜2 = 0 when r = 1; but to estimate the probability of two positive trials as zero, when one positive trial has been observed, is less than sensible. In marked contrast, any Bayes reference estimator provides a reasonable answer. For example, the intrinsic estimator of θ2 is simply (θ∗ )2 , where θ∗ is the intrinsic estimator of θ described in Section 5.1. In particular, if r = 1 and n = 2 the intrinsic estimator of θ2 is (as one would naturally expect) (θ∗ )2 = 1/4. Interval Estimation As the sample size increases, the frequentist coverage probability of a posterior q-credible region typically converges to q so that, for large samples, Bayesian credible intervals may (under regularity conditions) be interpreted as approximate frequentist confidence regions: under repeated sampling, a Bayesian q-credible region of θ based on a large sample will cover the true value of θ approximately 100q% of times. Detailed results are readily available for univariate problems. For instance, consider the probability model {p(D|ω), ω ∈ Ω}, let θ = θ(ω) be any univariate quantity of interest, and let t = t(D) ∈ T be any sufficient statistic. If θq (t) denotes the 100q% quantile of the posterior distribution of θ which corresponds to some unspecified prior, so that Z (45) Pr[θ ≤ θq (t)|t] = π(θ|t) dθ = q, θ≤θq (t)
Modern Bayesian Inference: Foundations and Objective Methods
293
then the coverage probability of the q-credible interval {θ; θ ≤ θq (t)}, Z p(t|ω) dt, (46) Pr[θq (t) ≥ θ|ω] = θq (t)≥θ
is such that (47) Pr[θq (t) ≥ θ|ω] = Pr[θ ≤ θq (t)|t] + O(n−1/2 ). This asymptotic approximation is true for all (sufficiently regular) positive priors. However, the approximation is better, actually O(n−1 ), for a particular class of priors known as (first-order) probability matching priors. For details on probablity matching priors see Datta and Sweeting [2005] and references therein. Reference priors are typically found to be probability matching priors, so that they provide this improved asymptotic agreement. As a matter of fact, the agreement (in regular problems) is typically quite good even for relatively small samples. EXAMPLE 11 Product of normal means. Consider the case where independent random samples {x1 , . . . , xn } and {y1 , . . . , ym } have respectively been taken from the normal densities N (x|ω1 , 1) and N (y|ω2 , 1), and suppose that the quantity of interest is the product of their means, φ = ω1 ω2 (for instance, one may be interested in inferences about the area φ of a rectangular piece of land, given measurements {xi } and {yj } of its sides). Notice that this is a simplified version of a problem that it is often encountered in the sciences, where one is interested in the product of several magnitudes, all of which have been measured with error. Using the procedure described in Example 8, with the natural approximating sequence induced by (ω1 , ω2 ) ∈ [−i, i]2 , the φ-reference prior is found to be (48) πφ (ω1 , ω2 |M, P0 ) ∝ (n ω12 + m ω22 )−1/2 , very different from the uniform prior πω1 (ω1 , ω2 |M, P0 ) = πω2 (ω1 , ω2 |M, P0 ) = 1 which should be used to make objective inferences about either ω1 or ω2 . The prior πφ (ω1 , ω2 ) may be shown to provide approximate agreement between Bayesian credible regions and frequentist confidence intervals for φ; indeed, this prior (with m = n) was originally suggested by Stein in the 1980’s to obtain such approximate agreement. The same example was later used by Efron [1986] to stress the fact that, even within a fixed probability model {p(D|ω), ω ∈ Ω}, the prior required to make objective inferences about some function of the parameters φ = φ(ω) must generally depend on the function φ. For further details on the reference analysis of this problem, see [Berger and Bernardo, 1989]. The numerical agreement between reference Bayesian credible regions and frequentist confidence intervals is actually perfect in special circumstances. Indeed, as Lindley [1958] pointed out, this is the case in those problems of inference which may be transformed to location-scale problems. EXAMPLE 3, continued. Inference on normal parameters. Let D = {x1 , . . . xn } be a random sample from a normal distribution N (x|µ, σ). As mentioned before,
294
Jos´ e M. Bernardo
the reference √ posterior of the quantity of interest µ is the Student distribution n − 1, n − 1). Thus, normalizing µ, the posterior distribution of t(µ) = St(µ|¯ x , s/ √ n − 1(¯ x −µ)/s, as a function of µ given D, is the standard Student St(t|0, 1, n−1) with n − 1 degrees of freedom. On the other hand, this function t is recognized to be precisely the conventional t statistic, whose sampling distribution is well known to also be standard Student with n − 1 degrees of freedom. It follows that, for all sample sizes, posterior reference credible intervals for µ given the data will be numerically identical to frequentist confidence intervals based on the sampling distribution of t. A similar result is obtained in inferences about the variance. Thus, the reference posterior distribution of λ = σ −2 is the Gamma distribution Ga(λ|(n−1)/2, ns2 /2) and, hence, the posterior distribution of r = ns2 /σ 2 , as a function of σ 2 given D, is a (central) χ2 with n−1 degrees of freedom. But the function r is recognized to be a conventional statistic for this problem, whose sampling distribution is well known to also be χ2 with n − 1 degrees of freedom. It follows that, for all sample sizes, posterior reference credible intervals for σ 2 (or any one-to-one function of σ 2 ) given the data will be numerically identical to frequentist confidence intervals based on the sampling distribution of r. ⊳ 5
INFERENCE SUMMARIES
From a Bayesian viewpoint, the final outcome of a problem of inference about any unknown quantity is nothing but the corresponding posterior distribution. Thus, given some data D and conditions C, all that can be said about any function ω of the parameters which govern the model is contained in the posterior distribution π(ω|D, C), and all that can be said about some function y of future observations from the same model is contained in its posterior predictive distribution p(y|D, C). Indeed, Bayesian inference may technically be described as a decision problem where the space of available actions is the class of those posterior probability distributions of the quantity of interest which are compatible with accepted assumptions. However, to make it easier for the user to assimilate the appropriate conclusions, it is often convenient to summarize the information contained in the posterior distribution by (i) providing values of the quantity of interest which, in the light of the data, are likely to be “close” to its true value and by (ii) measuring the compatibility of the results with hypothetical values of the quantity of interest which might have been suggested in the context of the investigation. In this section, those Bayesian counterparts of traditional estimation and hypothesis testing problems are briefly considered.
5.1
Estimation
In one or two dimensions, a graph of the posterior probability density of the quantity of interest (or the probability mass function in the discrete case) immediately
Modern Bayesian Inference: Foundations and Objective Methods
295
conveys an intuitive, “impressionist” summary of the main conclusions which may possibly be drawn on its value. Indeed, this is greatly appreciated by users, and may be quoted as an important asset of Bayesian methods. From a plot of its posterior density, the region where (given the data) a univariate quantity of interest is likely to lie is easily distinguished. For instance, all important conclusions about the value of the gravitational field in Example 3 are qualitatively available from Figure 3. However, this does not easily extend to more than two dimensions and, besides, quantitative conclusions (in a simpler form than that provided by the mathematical expression of the posterior distribution) are often required. Point Estimation Let D be the available data, which are assumed to have been generated by a probability model {p(D|ω), ω ∈ Ω}, and let θ = θ(ω) ∈ Θ be the ˜ quantity of interest. A point estimator of θ is some function of the data θ˜ = θ(D) which could be regarded as an appropriate proxy for the actual, unknown value of θ. Formally, to choose a point estimate for θ is a decision problem, where the action space is the class Θ of possible θ values. From a decision-theoretic perspective, to choose a point estimate θ˜ of some quantity θ is a decision to act as though θ˜ were θ, not to assert something about the value of θ (although desire to assert something simple may well be the reason to obtain an estimate). As prescribed by the foundations of decision theory (Section 2), to solve this decision problem it is ˜ θ) measuring the consequences of acting necessary to specify a loss function L(θ, ˜ when it is actually θ. The as if the true value of the quantity of interest were θ, expected posterior loss if θ˜ were used is Z ˜ ˜ θ) π(θ|D) dθ, (49) L[θ|D] = L(θ, Θ
and the corresponding Bayes estimator θ∗ is that function of the data, θ∗ = θ∗ (D), which minimizes this expectation.
EXAMPLE 12 Conventional Bayes estimators. For any given model and data, the Bayes estimator obviously depends on the chosen loss function. The loss function is context specific, and should be chosen in terms of the anticipated uses of the estimate; however, a number of conventional loss functions have been suggested for those situations where no particular uses are envisaged. These loss functions produce estimates which may be regarded as simple descriptions of the location of the posterior distribution. For example, if the loss function is quadratic, so ˜ θ) = (θ˜ − θ)t (θ˜ − θ), then the Bayes estimator is the posterior mean that L(θ, ∗ θ = E[θ|D], assuming that the mean exists. Similarly, if the loss function is a ˜ θ) = 0 if θ˜ belongs to a ball or radius ǫ centered zero-one function, so that L(θ, ˜ in θ and L(θ, θ) = 1 otherwise, then the Bayes estimator θ∗ tends to the posterior mode as the ball radius ǫ tends to zero, assuming that a unique mode exists. If θ is ˜ θ) = c1 (θ˜ − θ) if θ˜ ≥ θ, and univariate and the loss function is linear, so that L(θ, ˜ ˜ L(θ, θ) = c2 (θ − θ) otherwise, then the Bayes estimator is the posterior quantile of order c2 /(c1 + c2 ), so that Pr[θ < θ∗ ] = c2 /(c1 + c2 ). In particular, if c1 = c2 , the Bayes estimator is the posterior median. The results derived for linear loss funtions
296
Jos´ e M. Bernardo
clearly illustrate the fact that any possible parameter value may turn out be the Bayes estimator: it all depends on the loss function describing the consequences of the anticipated uses of the estimate. EXAMPLE 13 Intrinsic estimation. Conventional loss functions are typically noninvariant under reparametrization. It follows that the Bayes estimator φ∗ of a oneto-one transformation φ = φ(θ) of the original parameter θ is not necessarily φ(θ∗ ) (the univariate posterior median, which is invariant, is an interesting exception). Moreover, conventional loss functions focus on the “distance” between the estimate θ˜ and the true value θ, rather then on the “distance” between the probability models they label. Inference-oriented loss functions directly focus on how different the probability model p(D|θ, λ) is from its closest approximation within the family ˜ λi ), λi ∈ Λ}, and typically produce invariant solutions. An attractive {p(D|θ, ˜ θ) defined as the minimum logarithmic example is the intrinsic discrepancy, δ(θ, divergence between a probability model labeled by θ and a probability model ˜ When there are no nuisance parameters, this is given by labeled by θ. Z p(t|θ) ˜ ˜ ˜ dt, (50) δ(θ, θ) = min{κ(θ|θ), κ(θ|θ)}, κ(θi |θ) = p(t|θ) log p(t|θ i) T where t = t(D) ∈ T is any sufficient statistic (which may well be the whole data set D). The definition is easily extended to problems with nuisance parameters; in this case, ˜ θ, λ) = min δ(θ, ˜ λi , θ, λ) (51) δ(θ, λi ∈Λ
measures the logarithmic divergence from p(t|θ, λ) of its closest approximation ˜ and the loss function now depends on the complete parameter vecwith θ = θ, tor (θ, λ). Although not explicitly shown in the notation, the intrinsic discrepancy function typically depends on the sample size n; indeed, when the data consist of aR random sample D = {x1 , . . . , xn } from some model p(x|θ) then κ(θi |θ) = n X p(x|θ) log[p(x|θ)/p(x|θi )] dx so that the discrepancy associated with the full model is simply n times the discrepancy which corresponds to a single observation. The intrinsic discrepancy is a symmetric, non-negative loss function with a direct interpretation in information-theoretic terms as the minimum amount of information which is expected to be necessary to distinguish between the ˜ λi ), λi ∈ Λ}. model p(D|θ, λ) and its closest approximation within the class {p(D|θ, Moreover, it is invariant under one-to-one reparametrization of the parameter of interest θ, and does not depend on the choice of the nuisance parameter λ. The intrinsic estimator is naturally obtained by minimizing the reference posterior expected intrinsic discrepancy Z Z ˜ ˜ θ, λ) π(θ, λ|D) dθdλ. (52) d(θ|D) = δ(θ, Λ
Θ
Since the intrinsic discrepancy is invariant under reparametrization, minimizing its posterior expectation produces invariant estimators. For further details on intrinsic point estimation see [Bernardo and Ju´arez, 2003; Bernardo, 2006].
Modern Bayesian Inference: Foundations and Objective Methods
297
EXAMPLE 2, continued. Intrinsic estimation of a binomial parameter. In the estimation of a binomial proportion θ, given data D = (n, r), the Bayes reference estimator associated with the quadratic loss (the corresponding posterior mean) is E[θ|D] = (r + 21 )/(n + 1), while the quadratic loss based estimator of, say, the logodds φ(θ) = log[θ/(1 − θ)], is found to be E[φ|D] = ψ(r + 12 ) − ψ(n − r + 12 ) (where ψ(x) = d log[Γ(x)]/dx is the digamma function), which is not equal to φ(E[θ|D]). The intrinsic loss function in this problem is ˜ θ) = n min{κ(θ|θ), ˜ ˜ (53) δ(θ, κ(θ|θ)},
κ(θi |θ) = θ log
1−θ θ + (1 − θ) log , θi 1 − θi
and the corresponding intrinsicR estimator θ∗ is obtained by minimizing the ex˜ ˜ θ) π(θ|D) dθ. The exact value of θ∗ may be pected posterior loss d(θ|D) = δ(θ, obtained by numerical minimization, but a very good approximation is given by θ∗ ≈ (r + 31 )/(n + 23 ). Since intrinsic estimation is an invariant procedure, the intrinsic estimator of the log-odds will simply be the log-odds of the intrinsic estimator of θ. As one would expect, when r and n − r are both large, all Bayes estimators of any well-behaved function φ(θ) will cluster around φ(E[θ|D]). ⊳ Interval Estimation To describe the inferential content of the posterior distribution of the quantity of interest π(θ|D) it is often convenient to quote regions R ⊂ Θ of given probability under π(θ|D). For example, the identification of regions containing 50%, 90%, 95%, or 99% of the probability under the posterior may be sufficient to convey the general quantitative messages implicit in π(θ|D); indeed, this is the intuitive basis of graphical representations of univariate distributions like those provided by boxplots. R Any region R ⊂ Θ such that R π(θ|D)dθ = q (so that, given data D, the true value of θ belongs to R with probability q), is said to be a posterior q-credible region of θ. Notice that this provides immediately a direct intuitive statement about the unknown quantity of interest θ in probability terms, in marked contrast to the circumlocutory statements provided by frequentist confidence intervals. Clearly, for any given q there are generally infinitely many credible regions. A credible region is invariant under reparametrization; thus, for any q-credible region R of θ, φ(R) is a q-credible region of φ = φ(θ). Sometimes, credible regions are selected to have minimum size (length, area, volume), resulting in highest probability density (HPD) regions, where all points in the region have larger probability density than all points outside. However, HPD regions are not invariant under reparametrization: the image φ(R) of an HPD region R will be a credible region for φ, but will not generally be HPD; indeed, there is no compelling reason to restrict attention to HPD credible regions. In one-dimensional problems, posterior quantiles are often used to derive credible regions. Thus, if θq = θq (D) is the 100q% posterior quantile of θ, then R = {θ; θ ≤ θq } is a one-sided, typically unique q-credible region, and it is invariant under reparametrization. Indeed, probability centered q-credible
298
Jos´ e M. Bernardo
regions of the form R = {θ; θ(1−q)/2 ≤ θ ≤ θ(1+q)/2 } are easier to compute, and are often quoted in preference to HPD regions. EXAMPLE 3. Inference on normal parameters, continued. In the numerical example about the value of the gravitational field described in Figure 3a, the interval [9.788, 9.829] in the unrestricted posterior density of g is a HPD, 95%-credible region for g. Similarly, the interval [9.7803, 9.8322] in Figure 3b is also a 95%-credible region for g, but it is not HPD. ⊳ Decision theory may also be used to select credible regions. Thus, lowest posterior loss (LPL) regions, are defined as those where all points in the region have smaller posterior expected loss than all points outside. Using the intrinsic discrepancy as a loss function yields intrinsic credible regions which, as one would expect from an invariant loss function, are coherent under one-to-one transformations. For details, see [Bernardo, 2005b; 2007]. The concept of a credible region for a function θ = θ(ω) of the parameter vector is trivially extended to prediction problems. Thus, a posterior q-credible region for x ∈ X is aR subset R of the sample space X with posterior predictive probability q, so that R p(x|D)dx = q.
5.2
Hypothesis Testing
The reference posterior distribution π(θ|D) of the quantity of interest θ conveys immediate intuitive information on those values of θ which, given the assumed model, may be taken to be compatible with the observed data D, namely, those with a relatively high probability density. Sometimes, a restriction θ ∈ Θ0 ⊂ Θ of the possible values of the quantity of interest (where Θ0 may possibly consists of a single value θ0 ) is suggested in the course of the investigation as deserving special consideration, either because restricting θ to Θ0 would greatly simplify the model, or because there are additional, context specific arguments suggesting that θ ∈ Θ0 . Intuitively, the hypothesis H0 ≡ {θ ∈ Θ0 } should be judged to be compatible with the observed data D if there are elements in Θ0 with a relatively high posterior density. However, a more precise conclusion is often required and, once again, this is made possible by adopting a decision-oriented approach. Formally, testing the hypothesis H0 ≡ {θ ∈ Θ0 } is a decision problem where the action space has only two elements, namely to accept (a0 ) or to reject (a1 ) the proposed restriction. To solve this decision problem, it is necessary to specify an appropriate loss function, L(ai , θ), measuring the consequences of accepting or rejecting H0 as a function of the actual value θ of the vector of interest. Notice that this requires the statement of an alternative a1 to accepting H0 ; this is only to be expected, for an action is taken not because it is good, but because it is better than anything else that has been imagined. Given data D, the optimal action R will be to reject H0 if (and only if) the exL(a0 , θ) π(θ|D) dθ, is larger than the expected pected posterior loss of accepting, Θ R posterior loss of rejecting, Θ L(a1 , θ) π(θ|D) dθ, that is, if (and only if)
Modern Bayesian Inference: Foundations and Objective Methods
(54)
Z
Θ
[L(a0 , θ) − L(a1 , θ)] π(θ|D) dθ =
Z
299
∆L(θ) π(θ|D) dθ > 0.
Θ
Therefore, only the loss difference ∆L(θ) = L(a0 , θ) − L(a1 , θ), which measures the advantage of rejecting H0 as a function of θ, has to be specified. Thus, as common sense dictates, the hypothesis H0 should be rejected whenever the expected advantage of rejecting H0 is positive. A crucial element in the specification of the loss function is a description of what is actually meant by rejecting H0 . By assumption a0 means to act as if H0 were true, i.e. as if θ ∈ Θ0 , but there are at least two options for the alternative action a1 . This may either mean (i) the negation of H0 , that is to act as if θ ∈ / Θ0 or, alternatively, it may rather mean (ii) to reject the simplification implied by H0 and to keep the unrestricted model, θ ∈ Θ, which is true by assumption. Both options have been analyzed in the literature, although it may be argued that the problems of scientific data analysis where hypothesis testing procedures are typically used are better described by the second alternative. Indeed, an established model, identified by H0 ≡ {θ ∈ Θ0 }, is often embedded into a more general model, {θ ∈ Θ, Θ0 ⊂ Θ}, constructed to include possibly promising departures from H0 , and it is required to verify whether presently available data D are still compatible with θ ∈ Θ0 , or whether the extension to θ ∈ Θ is really required.
EXAMPLE 14 Conventional hypothesis testing. Let π(θ|D), θ ∈ Θ, be the posterior distribution of the quantity of interest, let a0 be the decision to work under the restriction θ ∈ Θ0 and let a1 be the decision to work under the complementary restriction θ ∈ / Θ0 . Suppose, moreover, that the loss structure has the simple, zero-one form given by {L(a0 , θ) = 0, L(a1 , θ) = 1} if θ ∈ Θ0 and, similarly, {L(a0 , θ) = 1, L(a1 , θ) = 0} if θ ∈ / Θ0 , so that the advantage ∆L(θ) of rejecting H0 is 1 if θ ∈ / Θ0 and it is −1 otherwise. With this loss function it is immediately found that the optimal action is to reject H0 if (and only if) Pr(θ ∈ / Θ0 |D) > Pr(θ ∈ Θ0 |D). Notice that this formulation requires that Pr(θ ∈ Θ0 ) > 0, that is, that the hypothesis H0 has a strictly positive prior probability. If θ is a continuous parameter and Θ0 has zero measure (for instance if H0 consists of a single point θ0 ), this requires the use of a non-regular “sharp” prior concentrating a positive probability mass on Θ0 . For details see [Kaas and Rafetery, 1995] and references therein. EXAMPLE 15 Intrinsic hypothesis testing. Again, let π(θ|D), θ ∈ Θ, be the posterior distribution of the quantity of interest, and let a0 be the decision to work under the restriction θ ∈ Θ0 , but let a1 now be the decision to keep the general, unrestricted model θ ∈ Θ. In this case, the advantage ∆L(θ) of rejecting H0 as a function of θ may safely be assumed to have the form ∆L(θ) = δ(Θ0 , θ) − δ ∗ , for some δ ∗ > 0, where (i) δ(Θ0 , θ) is some measure of the discrepancy between the assumed model p(D|θ) and its closest approximation within the class {p(D|θ0 ), θ0 ∈ Θ0 }, such that δ(Θ0 , θ) = 0 whenever θ ∈ Θ0 , and (ii) δ ∗ is a context dependent utility constant which measures the (necessarily positive) advantage of being able to work with the simpler model when it is true. Choices for both δ(Θ0 , θ) and δ ∗ which
300
Jos´ e M. Bernardo
may be appropriate for general use will now be described. For reasons similar to those supporting its use in point estimation, an attractive choice for the function d(Θ0 , θ) is an appropriate extension of the intrinsic discrepancy; when there are no nuisance parameters, this is given by (55) δ(Θ0 , θ) = inf min{κ(θ0 |θ), κ(θ|θ0 )} θ0 ∈Θ0
R where κ(θ0 |θ) = T p(t|θ) log{p(t|θ)/p(t|θ0 )}dt, and t = t(D) ∈ T is any sufficient statistic, which may well be the whole dataset D. As before, if the data D = {x1 , . . . , xn } consist of a random sample from p(x|θ), then Z p(x|θ) (56) κ(θ0 |θ) = n p(x|θ) log dx. p(x|θ0 ) X Naturally, the loss function δ(Θ0 , θ) reduces to the intrinsic discrepancy δ(θ0 , θ) of Example 13 when Θ0 contains a single element θ0 . Besides, as in the case of estimation, the definition is easily extended to problems with nuisance parameters, with (57) δ(Θ0 , θ, λ) =
inf
θ0 ∈Θ0 ,λ0 ∈Λ
δ(θ0 , λo , θ, λ).
The hypothesis H0 should be rejected if the posterior expected advantage of rejecting is Z Z δ(Θ0 , θ, λ) π(θ, λ|D) dθdλ > δ ∗ , (58) d(Θ0 |D) = Λ
∗
Θ
for some δ > 0. As an expectation of a non-negative quantity, d(Θ0 , D) is obvioulsly nonnegative. Morovever, if φ = φ(θ) is a one-to-one transformation of θ, then d(φ(Θ0 ), D) = d(Θ0 , D) so that, as one should clearly require, the expected intrinsic loss of rejecting H0 is invariant under reparametrization. It may be shown that, as the sample size increases, the expected value of d(Θ0 , D) under sampling tends to one when H0 is true, and tends to infinity otherwise; thus d(Θ0 , D) may be regarded as a continuous, positive measure of how inappropriate (in loss of information units) it would be to simplify the model by accepting H0 . In traditional language, d(Θ0 , D) is a test statistic for H0 and the hypothesis should be rejected if the value of d(Θ0 , D) exceeds some critical value δ ∗ . In sharp contrast to conventional hypothesis testing, this critical value δ ∗ is found to be a context specific, positive utility constant δ ∗ , which may precisely be described as the number of information units which the decision maker is prepared to lose in order to be able to work with the simpler model H0 , and does not depend on the sampling properties of the probability model. The procedure may be used with standard, continuous regular priors even in sharp hypothesis testing, when Θ0 is a zero-measure set (as would be the case if θ is continuous and Θ0 contains a single point θ0 ). Naturally, to implement the test, the utility constant δ ∗ which defines the rejection region must be chosen. Values of d(Θ0 , D) of about 1 should be regarded as
Modern Bayesian Inference: Foundations and Objective Methods
301
an indication of no evidence against H0 , since this is precisely the expected value of the test statistic d(Θ0 , D) under repeated sampling from the null. If follows from its definition that d(Θ0 , D) is the reference posterior expectation of the loglikelihood ratio against the null. Hence, values of d(Θ0 , D) of about log[12] ≈ 2.5, and log[150] ≈ 5 should be respectively regarded as an indication of mild evidence against H0 , and significant evidence against H0 . In the canonical problem of testing a value µ = µ0 for the mean of a normal distribution with known variance (see below), these values correspond to the observed sample mean x ¯ respectively lying 2 or 3 posterior standard deviations from the null value µ0 . Notice that, in sharp contrast to frequentist hypothesis testing, where it is hazily recommended to adjust the significance level for dimensionality and sample size, this provides an absolute scale (in information units) which remains valid for any sample size and any dimensionality. For further details on intrinsic hypothesis testing see [Bernardo and Rueda, 2003; Bernardo and P´erez, 2007]. EXAMPLE 16 Testing the value of a normal mean. Let the data D = {x1 , . . . , xn } be a random sample from a normal distribution N(x|µ, σ), where σ is assumed to be known, and consider the problem of testing whether these data are or are not compatible with some specific sharp hypothesis H0 ≡ {µ = µ0 } on the value of the mean. The conventional approach to this problem requires a non-regular prior which places a probability mass, say p0 , on the value µ0 to be tested, with the remaining 1 − p0 probability continuously distributed over ℜ. If this prior is chosen to be π(µ|µ 6= µ0 ) = N (µ|µ0 , σ0 ), Bayes theorem may be used to obtain the corresponding posterior probability, B01 (D, λ) p0 , (1 − p0 ) + p0 B01 (D, λ) 1 n n 1/2 z2 , exp − (60) B01 (D, λ) = 1 + λ 2n+λ √ where z = (¯ x − µ0 )/(σ/ n) measures, in standard deviations, the distance between x ¯ and µ0 and λ = σ 2 /σ02 is the ratio of model to prior variance. The function B01 (D, λ), a ratio of (integrated) likelihood functions, is called the Bayes factor in favour of H0 . With a conventional zero-one loss function, H0 should be rejected if Pr[µ0 |D, λ] < 1/2. The choices p0 = 1/2 and λ = 1 or λ = 1/2, describing particular forms of sharp prior knowledge, have been suggested in the literature for routine use. The conventional approach to sharp hypothesis testing deals with situations of concentrated prior probability; it assumes important prior knowledge about the value of µ and, hence, should not be used unless this is an appropriate assumption. Moreover [Bartlett, 1957], the resulting posterior probability is extremely sensitive to the specific prior specification. In most applications, H0 is really a hazily defined small region rather than a point. For moderate sample sizes, the posterior probability Pr[µ0 |D, λ] is an approximation to the posterior (59) Pr[µ0 |D, λ] =
302
Jos´ e M. Bernardo
probability Pr[µ0 − ǫ < µ < µ0 − ǫ|D, λ] for some small interval around µ0 which would have been obtained from a regular, continuous prior heavily concentrated around µ0 ; however, this approximation always breaks down for sufficiently large sample sizes. One consequence (which is immediately apparent from the last two equations) is that for any fixed value of the pertinent statistic z, the posterior probability of the null, Pr[µ0 |D, λ], tends to one as n → ∞. Far from being specific to this example, this unappealing behaviour of posterior probabilities based on sharp, non-regular priors generally known as Lindley’s paradox [Lindley, 1957] is always present in the conventional Bayesian approach to sharp hypothesis testing. The intrinsic approach may be used without assuming any sharp prior knowledge. The intrinsic discrepancy is δ(µ0 , µ) = n(µ − µ0 )2 /(2σ 2 ), a simple transformation of the standardized distance between µ and µ0 . The reference prior is uniform√and the corresponding (proper) posterior distribution is π(µ|D) = N (µ|¯ x, σ/ n). The expected value of δ(µ0 , µ) √ with respect to this posterior is d(µ0 , D) = (1 + z 2 )/2, where z = (¯ x − µ0 )/(σ/ n) is the standardized distance between x ¯ and µ0 . As foretold by the general theory, the expected value of d(µ0 , D) under repeated sampling is one if µ = µ0 , and increases linearly with n if µ = µ0 . Moreover, in this canonical example, to reject H0 whenever |z| > 2 or |z| > 3, that is whenever µ0 is 2 or 3 posterior standard deviations away from x¯, respectively corresponds to rejecting H0 whenever d(µ0 , D) is larger than 2.5, or larger than 5. If σ is unknown, the reference prior is π(µ, σ) = σ −1 , and the intrinsic discrepancy becomes 2 µ − µ0 n . (61) δ(µ0 , µ, σ) = log 1 + 2 σ The intrinsic test statistic d(µ0 , D) is found as the expected value of δ(µ0 , µ, σ) under the corresponding joint referenceposterior distribution; this may be exactly expressed in terms of hypergeometric functions, and is well approximated by 1 n t2 (62) d(µ0 , D) ≈ + log 1 + , 2 2 n √ P where t is the traditional statistic t = n − 1(¯ x − µ0 )/s, ns2 = j (xj − x ¯)2 . For instance, for samples sizes 5, 30 and 1000, and using the utility constant δ ∗ = 5, the hypothesis H0 would be rejected whenever |t| is respectively larger than 5.025, 3.240, and 3.007. 6
DISCUSSION
This article focuses on the basic concepts of the Bayesian paradigm, with special emphasis on the derivation of “objective” methods, where the results only depend on the data obtained and the model assumed. Many technical aspects have been spared; the interested reader is referred to the bibliography for further information. This final section briefly reviews the main arguments for an objective Bayesian approach.
Modern Bayesian Inference: Foundations and Objective Methods
6.1
303
Coherence
By using probability distributions to characterize all uncertainties in the problem, the Bayesian paradigm reduces statistical inference to applied probability, thereby ensuring the coherence of the proposed solutions. There is no need to investigate, on a case by case basis, whether or not the solution to a particular problem is logically correct: a Bayesian result is only a mathematical consequence of explicitly stated assumptions and hence, unless a logical mistake has been committed in its derivation, it cannot be formally wrong. In marked contrast, conventional statistical methods are plagued with counterexamples. These include, among many others, negative estimators of positive quantities, q-confidence regions (q < 1) which consist of the whole parameter space, empty sets of “appropriate” solutions, and incompatible answers from alternative methodologies simultaneously supported by the theory. The Bayesian approach does require, however, the specification of a (prior) probability distribution over the parameter space. The sentence “a prior distribution does not exist for this problem” is often stated to justify the use of non-Bayesian methods. However, the general representation theorem proves the existence of such a distribution whenever the observations are assumed to be exchangeable (and, if they are assumed to be a random sample then, a fortiori, they are assumed to be exchangeable). To ignore this fact, and to proceed as if a prior distribution did not exist, just because it is not easy to specify, is mathematically untenable.
6.2
Objectivity
It is generally accepted that any statistical analysis is subjective, in the sense that it is always conditional on accepted assumptions (on the structure of the data, on the probability model, and on the outcome space) and those assumptions, although possibly well founded, are definitely subjective choices. It is, therefore, mandatory to make all assumptions very explicit. Users of conventional statistical methods rarely dispute the mathematical foundations of the Bayesian approach, but claim to be able to produce “objective” answers in contrast to the possibly subjective elements involved in the choice of the prior distribution. Bayesian methods do indeed require the choice of a prior distribution, and critics of the Bayesian approach systematically point out that in many important situations, including scientific reporting and public decision making, the results must exclusively depend on documented data which might be subject to independent scrutiny. This is of course true, but those critics choose to ignore the fact that this particular case is covered within the Bayesian approach by the use of reference prior distributions which (i) are mathematically derived from the accepted probability model (and, hence, they are “objective” insofar as the choice of that model might be objective) and, (ii) by construction, they produce posterior probability distributions which, given the accepted probability model, only contain the information about their values which data may provide and, optionally, any further
304
Jos´ e M. Bernardo
contextual information over which there might be universal agreement.
6.3
Operational meaning
An issue related to objectivity is that of the operational meaning of reference posterior probabilities; it is found that the analysis of their behaviour under repeated sampling provides a suggestive form of calibration. Indeed, Pr[θ ∈ R|D] = R π(θ|D) dθ, the reference posterior probability that θ ∈ R, is both a measure of R the conditional uncertainty (given the assumed model and the observed data D) about the event that the unknown value of θ belongs to R ⊂ Θ, and the limiting proportion of the regions which would cover θ under repeated sampling conditional on data “sufficiently similar” to D. Under broad conditions (to guarantee regular asymptotic behaviour), all large data sets from the same model are “sufficiently similar” among themselves in this sense and hence, given those conditions, reference posterior credible regions are approximate frequentist confidence regions. The conditions for this approximate equivalence to hold exclude, however, important special cases, like those involving “extreme” or “relevant” observations. In very special situations, when probability models may be transformed to locationscale models, there is an exact equivalence; in those cases reference posterior credible intervals are, for any sample size, exact frequentist confidence intervals.
6.4
Generality
In sharp contrast to most conventional statistical methods, which may only be exactly applied to a handful of relatively simple stylized situations, Bayesian methods are defined to be totally general. Indeed, for a given probability model and prior distribution over its parameters, the derivation of posterior distributions is a welldefined mathematical exercise. In particular, Bayesian methods do not require any particular regularity conditions on the probability model, do not depend on the existence of sufficient statistics of finite dimension, do not rely on asymptotic relations, and do not require the derivation of any sampling distribution, nor (a fortiori) the existence of a “pivotal” statistic whose sampling distribution is independent of the parameters. However, when used in complex models with many parameters, Bayesian methods often require the computation of multidimensional definite integrals and, for a long time in the past, this requirement effectively placed practical limits on the complexity of the problems which could be handled. This has dramatically changed in recent years with the general availability of large computing power, and the parallel development of simulation-based numerical integration techniues like importance sampling or Markov chain Monte Carlo (MCMC). These methods provide a structure within which many complex models may be analyzed using generic software. MCMC is numerical integration using Markov chains. Monte Carlo integration proceeds by drawing samples from the required distributions, and computing sample averages to approximate expectations. MCMC methods
Modern Bayesian Inference: Foundations and Objective Methods
305
draw the required samples by running appropriately defined Markov chains for a long time; specific methods to construct those chains include the Gibbs sampler and the Metropolis algorithm, originated in the 1950’s in the literature of statistical physics. The development of improved algorithms and appropriate diagnostic tools to establish their convergence, remains a very active research area. For an introduction to MCMC methods in Bayesian inference, see [Gilks et al., 1996; Mira, 2005], and references therein. BIBLIOGRAPHY [Bartlett, 1957] M. Bartlett. A comment on D. V. Lindley’s statistical paradox. 44, 533–534, 1957. [Berger and Bernardo, 1989] J. O. Berger and J. M. Bernardo. Estimating a product of means: Bayesian analysis with reference priors. J. Amer. Statist. Assoc.,84, 200–207, 1989. [Berger and Bernardo, 1992a] J. O. Berger and J. M. Bernardo. Ordered group reference priors with applications to a multinomial problem. 79, 25–37, 1992. [Berger and Bernardo, 1992b] J. O. Berger and J. M. Bernardo. Reference priors in a variance components problem. Bayesian Analysis in Statistics and Econometrics,, 323–340, 1992. [Berger and Bernardo, 1992c] J. O. Berger and J. M. Bernardo. On the development of reference priors. 4, 35–60, 1992 (with discussion). [Berger et al., 2009] J. O. Berger, J. M. Bernardo, and D. Sun. The formal definition of reference priors. Ann Statist., 37, 905–938, 2009. [Berger, 1985] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Berlin: Springer., 1985. [Bernardo, 1979a] J. M. Bernardo. Expected information as expected utility. Ann Statist.,7, 686–690, 1979. [Bernardo, 1979b] J. M. Bernardo. Reference posterior distributions for Bayesian inference. J. R. Statist. Soc. B 41, 113-147 (with discussion). Reprinted in Bayesian Inference 1 (G. C. Tiao and N. G. Polson, eds). Oxford: Edward Elgar, 229-263, 1979. [Bernardo, 1981] J. M. Bernardo. Reference decisions. Symposia Mathematica 25, 85–94, 1981. [Bernardo, 1997] J. M. Bernardo. Noninformative priors do not exist. J. Statist. Planning and Inference,65, 159–189, 1997 (with discussion). [Bernardo, 2005a] J. M. Bernardo. Reference analysis. Handbook of Statistics 25 (D. K. Dey and C. R. Rao eds.) Amsterdam: Elsevier, 17–90, 2005. [Bernardo, 2005b] J. M. Bernardo. Intrinsic credible regions: An objective Bayesian approach to interval estimation. Test,14, 317–384, 2005 (with discussion). [Bernardo, 2006] J. M. Bernardo. Intrinsic point estimation of the normal variance. Bayesian Statistics and its Applications. (S. K. Upadhyay, U. Singh and D. K. Dey, eds.) New Delhi: Anamaya Pub, 110-121, 2006. [Bernardo, 2007] J. M. Bernardo. Objective Bayesian point and region estimation in locationscale models. Sort 14, 3–44, 2007. [Bernardo and Ju´ arez, 2003] J. M. Bernardo and M. A. Ju¨ arez. Intrinsic Estimation. 7, 465–476, 2003. [Bernardo and P´ erez, 2007] J. M. Bernardo and S. P´ eez. Comparing normal means: New methods for an old problem. Bayesian Analysis 2, 45–58, 2007. [Bernardo and R´ amon, 1998] J. M. Bernardo and J. M. Ram´ on. An introduction to Bayesian reference analysis: inference on the ratio of multinomial parameters. The Statistician,47, 1–35, 1998. [Bernardo and Rueda, 2002] J. M. Bernardo and R. Rueda. Bayesian hypothesis testing: A reference approach. International Statistical Review 70 , 351-372, 2002. [Bernardo and Smith, 1994] J. M. Bernardo and A. F. M. Smith. Bayesian Theory, Chichester: Wiley, 1994. Second edition forthcoming. [Box and Tiao, 1973] G. E. P. Box and G. C. Tiao. Bayesian Inference in Statistical Analysis. Reading, MA: Addison-Wesley, 1973.
306
Jos´ e M. Bernardo
[Datta and Sweeting, 2005] G. S. Datta and T. J. Sweeting. Probability matching priors. Handbook of Statistics 25 (D. K. Dey and C. R. Rao eds.) Amsterdam: Elsevier, 91–114, 2005. [Dawid et al., 1973] A. P. Dawid, M. Stone, and J. V. Zidek. Marginalization paradoxes in Bayesian and structural inference. J. R. Statist. Soc. B 35, 189-233, 1973 (with discussion). [de Finetti, 1937] B. de Finetti. La pr´ evision: ses lois logiques, ses sources subjectives. Ann. Inst. H. Poincar´ e 7, 1–68, 1937. Reprinted in 1980 as ‘Foresight; its logical laws, its subjective sources’ in Studies in Subjective Probability,, 93–158. [de Finetti, 1970] B. de Finetti Teoria delle Probabilit´ e Turin: Einaudi, 1970. English translation as Theory of Probability Chichester: Wiley,, 1975. [DeGroot, 1970] M. H. DeGroot. Optimal Statistical Decisions, New York: McGraw Hill,, 1970. [Efron, 1986] B. Efron. Why isn’t everyone a Bayesian? Amer. Statist.,40, 1–11, 198 (with discussion). [Geisser, 1993] S. Geisser. Predictive Inference: an Introduction. London: Chapman and Hall, 1993. [Gilks et al., 1996] W. R. Gilks, S. Y. Richardson, and D. J. Spiegelhalter, eds. Markov Chain Monte Carlo in Practice. London: Chapman and Hall, 1996. [Jaynes, 1976] E. T. Jaynes. Confidence intervals vs. Bayesian intervals. Foundations of Probability Theory, Statistical Inference and Statistical Theories of Science 2 (W. L. Harper and and C. A. Hooker, eds). Dordrecht: Reidel, 175–257, 1976 (with discussion). [Jeffreys, 1939] H. Jeffreys. Theory of Probability. Oxford: Oxford University Press. Third edition in 1961, Oxford: Oxford University Press, 1939. [Kass and Raftery, 1995] R. E. Kass and A. E. Raftery. Bayes factors. J. Amer. Statist. Assoc.,90, 773–795, 1995. [Kass and Wasserman, 1996] R. E. Kass and L. Wasserman. The selection of prior distributions by formal rules. J. Amer. Statist. Assoc.,91, 1343–1370, 1996. [Laplace, 1812] P. S. Laplace. Th´ eorie Analytique des Probabilit´ es. Paris: Courcier, 1812. Reprinted as Oeuvres Compl´ etes de Laplace 7, 1878–1912. Paris: Gauthier-Villars. [Lindley, 1957] D. V. Lindley. A statistical paradox. 44, 187–192, 1957. [Lindley, 1958] D. V. Lindley. Fiducial distribution and Bayes’ Theorem. J. Roy. Statis. Soc.,20, 102–107, 1958. [Lindley, 1965] D. V. Lindley. Introduction to Probability and Statistics from a Bayesian Viewpoint. Cambridge: Cambridge University Press, 1965. [Lindley, 1972] D. V. Lindley. Bayesian Statistics, a Review. Philadelphia, PA: SIAM, 1972. [Liseo, 2005] B. Liseo. The elimination of nuisance parameters. Handbook of Statistics 25 (D. K. Dey and C. R. Rao eds.) Amsterdam: Elsevier, 193–219, 2005. [Mira, 2005] A. Mira. MCMC methods to estimate Bayesian parametric models. Handbook of Statistics 25 (D. K. Dey and C. R. Rao eds.) Amsterdam: Elsevier, 415–436, 2005. [Ramsey, 1926] F. P. Ramsey. Truth and probability. The Foundations of Mathematics and Other Logical Essays (R. B. Braithwaite, ed.). London: Kegan Paul 1926 (1931), 156–198. Reprinted in 1980 in Studies in Subjective Probability,, 61–92. [Savage, 1954] L. J. Savage. The Foundations of Statistics. New York: Wiley, 1954. Second edition in 1972, New York: Dover. [Stein, 1959] C. Stein. An example of wide discrepancy between fiducial and confidence intervals. Ann. Math. Statist., 30, 877–880, 1959. [Zellner, 1971] A. Zellner. An Introduction to Bayesian Inference in Econometrics. New York: Wiley, 1971. Reprinted in 1987, Melbourne, FL: Kreiger.
EVIDENTIAL PROBABILITY AND OBJECTIVE BAYESIAN EPISTEMOLOGY Gregory Wheeler and Jon Williamson
1
INTRODUCTION
Evidential probability (EP), developed by Henry Kyburg, offers an account of the impact of statistical evidence on single-case probability. According to this theory, observed frequencies of repeatable outcomes determine a probability interval that can be associated with a proposition. After giving a comprehensive introduction to EP in §2, in §3 we describe a recent variant of this approach, second-order evidential probability (2oEP). This variant, introduced in [Haenni et al., 2010], interprets a probability interval of EP as bounds on the sharp probability of the corresponding proposition. In turn, this sharp probability can itself be interpreted as the degree to which one ought to believe the proposition in question. At this stage we introduce objective Bayesian epistemology (OBE), a theory of how evidence helps determine appropriate degrees of belief (§4). OBE might be thought of as a rival to the evidential probability approaches. However, we show in §5 that they can be viewed as complimentary: one can use the rules of EP to narrow down the degree to which one should believe a proposition to an interval, and then use the rules of OBE to help determine an appropriate degree of belief from within this interval. Hence bridges can be built between evidential probability and objective Bayesian epistemology. 2
2.1
EVIDENTIAL PROBABILITY
Motivation
Rudolf Carnap [Carnap, 1962] drew a distinction between probability1 , which concerned rational degrees of belief, and probability2 , which concerned statistical regularities. Although he claimed that both notions of probability were crucial to scientific inference, Carnap practically ignored probability2 in the development of his systems of inductive logic. Evidential probability (EP) [Kyburg, 1961; Kyburg and Teng, 2001], by contrast, is a theory that gives primacy to probability2 , and Kyburg’s philosophical program was an uncompromising approach to see how far he could go with relative frequencies. Whereas Bayesianism springs from the view Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
308
Gregory Wheeler and Jon Williamson
that probability1 is all the probability needed for scientific inference, EP arose from the view that probability2 is all that we really have. The theory of evidential probability is motivated by two basic ideas: probability assessments should be based upon relative frequencies, to the extent that we know them, and the assignment of probability to specific individuals should be determined by everything that is known about that individual. Evidential probability is conditional probability in the sense that the probability of a sentence χ is evaluated given a set of sentences Γδ . But the evidential probability of χ given Γδ , written Prob(χ, Γδ ), is a meta-linguistic operation similar in kind to the relation of provability within deductive systems. The semantics governing the operator Prob(·, ·) is markedly dissimilar to axiomatic theories of probability that take conditional probability as primitive, such as the system developed by Lester Dubbins [Dubbins, 1975; Arl´o-Costa and Parikh, 2005], and it also resists reduction to linear [de Finetti, 1974] as well as lower previsions [Walley, 1991]. One difference between EP and the first two theories is that EP is interval-valued rather than point-valued, because the relative frequencies that underpin assignment of evidential probability are typically incomplete and approximate. But more generally, EP assignments may violate coherence. For example, suppose that χ and ϕ are sentences in the object language of evidential probability. The evidential probability of χ ∧ ϕ given Γδ might fail to be less than or equal to the evidential probability that χ given Γδ .1 A point to stress from the start is that evidential probability is a logic of statistical probability statements, and there is nothing in the activity of observing and recording statistical regularities that guarantees that a set of statistical probability statements will comport to the axioms of probability. So, EP is neither a species of Carnapian logical probability nor a kind of Bayesian probabilistic logic.2,3 EP is instead a logic for approximate reasoning, thus it is more similar in kind to the theory of rough sets [Pawlak, 1991] and to systems of fuzzy logic [Dubois and Prade, 1980] than to probabilistic logic. The operator Prob(·, ·) takes as arguments a sentence χ in the first coordinate and a set of statements Γδ in the second. The statements in Γδ represent a knowledge base, which includes categorical statements as well as statistical generalities. Theorems of logic and mathematics are examples of categorical statements, but so too are contingent generalities. One example of a contingent categorical statement is the ideal gas law. EP views the propositions “2 + 2 = 4” and “P V = nRT ” 1 Specifically, the lower bound of Prob(χ ∧ ϕ, Γ ) may be strictly greater than the lower bound δ of Prob(χ, Γδ ). 2 See the essays by Levi and by Seidenfeld in [Harper and Wheeler, 2007] for a discussion of the sharp differences between EP and Bayesian approaches, particularly on the issue of conditionalization. A point sometimes overlooked by critics is that there are different systems of evidential probability corresponding to different conditions we assume to hold. Results pertaining to a qualitative representation of EP inference, for instance, assume that Γδ is consistent. A version of conditionalization holds in EP given that there is specific statistical statement pertaining to the relevant joint distribution. See [Kyburg, 2007] and [Teng, 2007]. 3 EP does inherit some notions from Keynes’s [Keynes, 1921], however, including that probabilities are interval-valued and not necessarily comparable.
Evidential Probability and Objective Bayesian Epistemology
309
within a chemistry knowledge base as indistinguishable analytic truths that are built into a particular language adopted for handling statistical statements to do with gasses. In light of EP’s expansive view of analyticity, the theory represents all categorical statements as universally quantified sentences within a guarded fragment of first-order logic [Andr´eka et al., 1998].4 Statistical generalities within Γδ , by contrast, are viewed as direct inference statements and are represented by syntax that is unique to evidential probability. Direct inference, recall, is the probability assigned a target subclass given known frequency information about a reference population, and is often contrasted to indirect inference, which is the assignment of probability to a population given observed frequencies in a sample. Kyburg’s ingenious idea was to solve the problem of indirect inference by viewing it as a form of direct inference. Since the philosophical problems concerning direct inference are much less contentious than those raised by indirect inference, the unusual properties and behavior of evidential probability should be weighed against this achievement [Levi, 2007]. Direct inference statements are statements that record the observed frequency of items satisfying a specified reference class that also satisfy a particular target class, and take the form of %~x(τ (~x), ρ(~x), [l, u]). This schematic statement says that given a sequence of propositional variables ~x that satisfies the reference class predicate ρ, the proportion of ρ that also satisfies the target class predicate τ is between l and u. Syntactically, ‘τ (~x), ρ(~x), [l, u]’ is an open formula schema, where ‘τ (·)’ and ‘ρ(·)’ are replaced by open first-order formulas, ‘~x’ is replaced by a sequence of propositional variables, and ‘[l, u]’ is replaced by a specific sub-interval of [0, 1]. The binding operator ‘%’ is similar to the ordinary binding operators (∀, ∃) of firstorder logic, except that ‘%’ is a 3-place binding operator over the propositional variables appearing the target formula τ (~x) and the reference formula ρ(~x), and binding those formulas to an interval.5 The language Lep of evidential probability then is a guarded first-order language augmented to include direct inference statements. There are additional formation rules for direct inference statements that are designed to block spurious inference, but we shall pass over these details of the theory.6 An example of a direct inference statement that might appear in Γδ is %x(B(x), A(x), [.71, .83]), which expresses that the proportion of As that are also Bs lies between 0.71 and 0.83. As for semantics, a model M of Lep is a pair, hD, Ii, where D is a two-sorted domain consisting of mathematical objects, Dm , and a finite set of empirical objects, De . EP assumes that there is a first giraffe and a last carbon molecule. I is 4A
guarded fragment of first-order logic is a decidable fragment of first-order logic. we relax notation and simply use an arbitrary variable ‘x’ for ‘~ x’. 6 See [Kyburg and Teng, 2001]. 5 Hereafter
310
Gregory Wheeler and Jon Williamson
an interpretation function that is the union of two partial functions, one defined on Dm and the other on De . Otherwise M behaves like a first-order model: the interpretation function I maps (empirical/mathematical) terms into the (empirical/mathematical) elements of D, monadic predicates into subsets of D, n-arity relation symbols into Dn , and so forth. Variable assignments also behave as one would expect, with the only difference being the procedure for assigning truth to direct inference statements. The basic idea behind the semantics for direct inference statements is that the statistical quantifier ‘%’ ranges over the finite empirical domain De , not the field terms l, u that denote real numbers in Dm . This means that the only free variables in a direct inference statement range over a finite domain, which will allow us to look at proportions of models in which a sentence is true. A satisfaction set of an open formula ϕ whose only free n variables are empirical in the subset of Dn that satisfies ϕ. A direct inference statement %x(τ (x), ρ(x), [l, u]) is true in M under variable assignment v iff the cardinality of the satisfaction sets for the open formula ρ under v is greater than 0 and the ratio of the cardinality of satisfaction sets for τ (x∗ ) ∧ ρ(x∗ ) over the cardinality of the satisfaction sets for ρ(x) (under v) is in the closed interval [l, u], where all variables of x occur in ρ, all variables of τ occur in ρ, and x∗ is the sequence of variables free in ρ but not bound by %x [Kyburg and Teng, 2001]. The operator Prob(·, ·) then provides a semantics for a nonmonotonic consequence operator [Wheeler, 2004; Kyburg et al., 2007]. The structural properties enjoyed by this consequence operator are as follows:7 Properties of EP Entailment: Let |= denote classical consequence and let ≡ denote classical logical equivalence. Whenever µ ∧ ξ, ν ∧ ξ are sentences of Lep , Right Weakening: if µ |≈ ν and ν |= ξ then µ |≈ ξ. Left Classical Equivalence: if µ |≈ ν and µ ≡ ξ then ξ |≈ ν. (KTW) Cautious Monotony: if µ |= ν and µ |≈ ξ then µ ∧ ξ |≈ ν. (KTW) Premise Disjunction: if µ |= ν and ξ |≈ ν then µ ∨ ξ |≈ ν. (KTW) Conclusion Conjunction: if µ |= ν and µ |≈ ξ then µ |≈ ν ∧ ξ. As an aside, this qualitative EP-entailment relation presents challenges in handling disjunction in the premises since the KTW disjunction property admits a novel reversal effect similar to, but distinct from, Simpson’s paradox [Kyburg et al., 2007; Wheeler, 2007]. This raises a question over how best to axiomatize EP. One approach, which is followed by [Hawthorne and Makinson, 2007] and considered in 7 Note that these properties are similar to, but strictly weaker than, the properties of the class of cumulative consequence relations specified by System P [Kraus et al., 1990]. To yield the axioms of System P, replace the nonmonotonic consequence operator |∼ for |= in the premise position of [And*], [Or*], and [Cautious Monotonicity*].
Evidential Probability and Objective Bayesian Epistemology
311
[Kyburg et al., 2007], is to replace Boolean disjunction by ‘exclusive-or’. While this route ensures nice properties for |≈, it does so at the expense of introducing a dubious connective into the object language that is neither associative nor compositional.8 Another approach explored in [Kyburg et al., 2007] is a weakened disjunction axiom (KTW Or) that yields a sub-System P nonmonotonic logic and preserves the standard properties of the positive Boolean connectives. Now that we have a picture of what EP is, we turn to consider the inferential behavior of the theory. We propose to do this with a simple ball-draw experiment before considering the specifics of the theory in more detail in the next section. EXAMPLE 1. Suppose the proportion of white balls (W ) in an urn (U ) is known to be within [.33, 4], and that ball t is drawn from U . These facts are represented in Γδ by the sentences, %x(W (x), U (x), [.33, .4]) and U (t). (i) If these two statements are all that we know about t, i.e., they are the only statements in Γδ pertaining to t, then Prob(W (t), Γδ ) = [.33, .4]. (ii) Suppose additionally that the proportion of plastic balls (P ) that are white is observed to be between [.31, .36], t is plastic, and that every plastic ball is a white ball. That means that %x(P (x), U (x), [.31, .36]), P (t), and ∀x.P (x) → W (x) are added to Γδ as well. Then there is conflicting statistical knowledge about t, since either: 1. the probability that ball t is white is between [.33, .4], by reason of %x(W (x), U (x), [.33, .4]), or 2. the probability that ball t is white is between [.31, .36], by reason of %x(W (x), P (x), [.31, .36]), may apply. There are several ways that statistical statements may conflict and there are rules for handling each type, which we will discuss in the next section. But in this particular case, because it is known that the class of plastic balls is more specific than the class of balls in U and we have statistics for the proportion of plastic balls that are also white balls, the statistical statement in (2) dominates the statement in (1). So, the probability that t is white is in [.31, .36]. (iii) Adapting an example from [Kyburg and Teng, 2001, 216], suppose U is partitioned into three cells, u1 , u2 , and u3 , and that the following compound experiment is performed. First, a cell of U is selected at random. Then a ball is drawn at random from that cell. To simplify matters, suppose that there are 25 balls in U and 9 are white such that 3 of 5 balls from u1 are white, but only 3 of 10 balls in u2 and 3 of 10 in u3 are white. The following table summarizes this information. 8 Example: ‘A xor B xor C’ is true if A, B, C are; and ‘(A xor B) xor C’ is not equivalent to ‘A xor (B xor C)’ when A is false but B and C both true.
312
Gregory Wheeler and Jon Williamson
Table 1. Compound Experiment u1 u2 u3 W 3 3 3 9 W 2 7 7 16 5 10 10 25
We are interested in the probability that t is white, but we have a conflict. 9 Given these over all precise values, we would have Prob(W (t), Γδ ) = 25 . However, since we know that t was selected by performing this compound experiment, then we also have the conflicting direct inference statement %x, y(W ∗ (x, y), U ∗ (x, y), [.4, .4]), where U ∗ is the set of compound two stage experiments, and W ∗ is the set of outcomes in which the ball selected is white.9 We should prefer the statistics from the compound experiment because they are richer in information. So, the probability that t is white is .4. (iv) Finally, if there happens to be no statistical knowledge in Γδ pertaining to t, then we would be completely ignorant of the probability that t is white. So in the case of total ignorance, Prob(W (t), Γδ ) = [0, 1]. We now turn to a more detailed account of how EP calculates probabilities.
2.2
Calculating Evidential Probability
In practice an individual may belong to several reference classes with known statistics. Selecting the appropriate statistical distribution among the class of potential probability statements is the problem of the reference class. The task of assigning evidential probability to a statement χ relative to a set of evidential certainties relies upon a procedure for eliminating excess candidates from the set of potential candidates. This procedure is described in terms of the following definitions. Potential Probability Statement: A potential probability statement for χ with respect to Γδ is a tuple ht, τ (t), ρ(t), [l, u]i, such that instances of χ ↔ τ (t), ρ(t), and %x(τ (x), ρ(x), [l, u]) are each in Γδ . Given χ, there are possibly many target statements of form τ (t) in Γδ that have the same truth value as χ. If it is known that individual t satisfies ρ, and known that between .7 and .8 of ρ’s are also τ ’s, then ht, τ (t), ρ(t), [.7, .8]i represents a potential probability statement for χ based on the knowledge base Γδ . Our focus 9 Γ should also include the categorical statements ∀x, y(U ∗ hx, yi → W (y)), which says that δ the second stage of U concerns the proportion of balls that are white, and three statements of the form p W ∗ (µ, t) ↔ W (t)q , where µ is replaced by u1 , u2 , u3 , respectively. This statement tells us that everything that’s true of W ∗ is true of W , which is what ensures that this conflict is detected.
Evidential Probability and Objective Bayesian Epistemology
313
will be on the statistical statements %x(τ (x), ρ(x), [l, u]) in Γδ that are the basis for each potential probability statement. Selecting the appropriate probability interval for χ from the set of potential probability statements reduces to identifying and resolving conflicts among the statistical statements that are the basis for each potential probability statement. Conflict: Two intervals [l, u] and [l′ , u′ ] conflict iff neither [l, u] ⊂ [l′ , u′ ] nor [l, u] ⊃ [l′ , u′ ]. Two statistical statements conflict iff their intervals conflict. Note that conflicting intervals may be disjoint or intersect. For technical reasons an interval is said to conflict with itself. Cover: Let X be a set of intervals. An interval [l, u] covers X iff for every [l′ , u′ ] ∈ X, l ≤ l′ and u′ ≤ u. A cover [l, u] of X is the smallest cover, Cov(X), iff for all covers [l∗ , u∗ ] of X, l∗ ≤ l and u ≤ u∗ . Difference Set: (i) Let X be a non-empty set of intervals and P(X) be the powerset of X. A non-empty Y ∈ P(X) is a difference set of X iff Y includes every x ∈ X that conflicts with some y ∈ Y . (ii) Let X be the set of intervals associated with a set Γ of statistical statements, and Y be the set of intervals associated with a set Λ of statistical statements. Λ is a difference set to Γ iff Y is closed under difference with respect to X. EXAMPLE 2. An example might help. Let X be the set of intervals [.30, .40], [.35, .45], [.325, .475], [.50, .55], [.30, .70], [.20, .60], [.10, .90]. There are three sets closed under difference with respect to X: (i) {[.30, .40], [.35, .45], [.325, .475], [.50, .55]}, (ii) {[.30, .70], [.20, .60]}, (iii) {[.10, .90]}. The intuitive idea behind a difference set is to eliminate intervals from a set that are broad enough to include all other intervals in that set. The interval [.10, .90] is the broadest interval in X. So, it only appears as a singleton difference set and is not included in any other difference set of X. It is not necessary that all intervals in a difference set X be pairwise conflicting intervals. Difference sets identify the set of all possible conflicts for each potential probability statement in order to find that conflicting set with the shortest cover. Minimal Cover Under Difference: (i) Let X be a non-empty set of intervals and Y = {Y1 , . . . , Yn } the set of all difference sets of X. The minimal cover under difference of X is the smallest cover of the elements of Y, i.e., the shortest cover in {Cov(Y1 ), . . . , Cov(Yn )}. (ii) Let X be the set of intervals associated with a set Γ of statistical statements, and Y be the set of all difference sets of X associated with a set Λ of statistical statements. Then the minimal cover under difference of Γ is the minimal cover under difference of X. EP resolves conflicting statistical data concerning χ by applying two principles to the set of potential probability assignments, Richness and Specificity, to yield
314
Gregory Wheeler and Jon Williamson
a class of relevant statements. The (controversial) principle of Strength is then applied to this set of relevant statistical statements, yielding a unique probability interval for χ. For discussion of these principles, see [Teng, 2007]. We illustrate these principles in terms of a pair (ϕ, ϑ) of conflicting statistical statements for χ, and represent their respective reference formulas by ρϕ and ρϑ . The probability interval assigned to χ is the shortest cover of the relevant statistics remaining after applying these principles. 1. [Richness] If ϕ and ϑ conflict and ϑ is based on a marginal distribution while ϕ is based on the full joint distribution, eliminate ϑ. 2. [Specificity] If ϕ and ϑ both survive the principle of richness, and if ρϕ ⊂ ρϑ , then eliminate hτ, ρϑ , [l, u]i from all difference sets. The principle of specificity says that if it is known that the reference class ρϕ is included in the reference class ρϑ , then eliminate the statement ϑ. The statistical statements that survive the sequential application of the principle of richness followed by the principle of specificity are called relevant statistics. 3. [Strength] Let ΓRS be the set of relevant statistical statements for χ with respect to Γδ , and let the set {Λ1 , . . . , Λn } be the set of difference sets of ΓRS . The principle of strength is the choosing of the minimal cover under difference of ΓRS , i.e., the selection of the shortest cover in {Cov(Λ1 ), . . . , Cov(Λn )}. The evidential probability of χ is the minimal cover under difference of ΓRS . We may define Γǫ , the set of practical certainties, in terms of a body of evidence Γδ : Γǫ = {χ : ∃ l, u (Prob(¬χ, Γδ ) = [l, u] ∧ u ≤ ǫ)}, or alternatively, Γǫ = {χ : ∃ l, u (Prob(χ, Γδ ) = [l, u] ∧ l ≥ 1 − ǫ)}. The set Γǫ is the set of statements that the evidence Γδ warrants accepting; we say a sentence χ is ǫ-accepted if χ ∈ Γǫ . Thus we may add to our knowledge base statements that are nonmonotonic consequences of Γδ with respect to a threshold point of acceptance. Finally, we may view the evidence Γδ to provide real-valued bounds on ‘degrees of belief’ owing to the logical structure of sentences accepted into Γδ . However, the probability interval [l, u] associated with χ does not specify a range of equally rational degrees of belief between l and u: the interval [l, u] itself is not a quantity, only l and u are quantities, which are used to specify bounds. On this view, no degree of belief within [l, u] is defensible, which is in marked contrast to the view offered by Objective Bayesianism.
Evidential Probability and Objective Bayesian Epistemology
3
3.1
315
SECOND-ORDER EVIDENTIAL PROBABILITY
Motivation
Second-order evidential probability—developed in [Haenni et al., 2010]—goes beyond Kyburg’s evidential probability in two ways. First, it treats an EP interval as bounds on sharp probability. Second, it disentangles reasoning under uncertainty from questions of acceptance and rejection. Here we explain both moves in more detail. 3.1.0.1 Bounds on Degrees of Belief. Kyburg maintained that one can interpret an evidential probability interval for proposition χ as providing bounds on the degree to which an agent should believe χ, but he had reservations about this move: Should we speak of partial beliefs as ‘degrees of belief’ ? Although probabilities are intervals, we could still do so. Or we could say that any ‘degree of belief’ satisfying the probability bounds was ‘rational’. But what would be the point of doing so? We agree with Ramsey that logic cannot determine a real-valued a priori degree of belief in pulling a black ball from an urn. This seems a case where degrees of belief are not appropriate. No particular degree of belief is defensible. We deny that there are any appropriate a priori degrees of belief, though there is a fine a priori probability: [0, 1]. There are real valued bounds on degrees of belief, determined by the logical structure of our evidence. [Kyburg, 2003, p. 147] Kyburg is making the following points here. Evidence rarely determines a unique value for an agent’s degree of belief—rather, it narrows down rational belief to an interval. One can view this interval as providing bounds on rational degree of belief, but since evidence can not be used to justify the choice of one point over another in this interval, there seems to be little reason to talk of the individual points and one can instead simply treat the interval itself as a partial belief. This view fits very well with the interpretation of evidential probability as some kind of measure of weight of evidence. (And evidential probability provides a natural measure of weight of evidence: the narrower the interval, the weightier the evidence.) Hence if evidence only narrows down probability to an interval, then there does indeed seem to be little need to talk of anything but the interval when measuring features of the evidence. But the view does not address how to fix a sharp degree of belief—intentionally so, since Kyburg’s program was designed in part to show us how far one can go with relative frequency information alone. Even so, we may ask whether there is a way to use the resources of evidential probability to fix sharp degrees of belief. In other words, we might return to Carnap’s original distinction between probability1 and probability2 and ask how a theory of the latter can be used to constrain the former. If we want to talk
316
Gregory Wheeler and Jon Williamson
Step Step Step Step
1 2 3 4
Evidence Acceptance Uncertain reasoning Acceptance
{P (ϕ) ∈ [lϕ , uϕ ]} Γδ = {ϕ : lϕ ≥ 1 − δ} {P (χ) ∈ [lχ , uχ ]} Γε = {χ : lχ ≥ 1 − ε}
Figure 1. The structure of (first-order) EP inferences.
not only of the quality of our evidence but also of our disposition to act on that evidence, then it would appear that we need a richer language than that provided by EP alone: while evidence—and hence EP—cannot provide grounds to prefer one point in the interval over another as one’s degree of belief, there may be other, non-evidential grounds for some such preference, and formalising this move would require going beyond EP. Reconciling EP with a Bayesian approach has been considered to be highly problematic [Levi, 1977; Levi, 1980; Seidenfeld, 2007], and was vigorously resisted by Kyburg throughout his life. On the other hand, Kyburg’s own search for an EP-compatible decision theory was rather limited [Kyburg, 1990]. It is natural then to explore how to modify evidential probability in order that it might handle point-valued degrees of belief and thereby fit with Bayesian decision theory. Accordingly second-order EP departs from Kyburg’s EP by viewing evidential probability intervals as bounds on rational degree of belief, P (χ) ∈ Prob(χ, Γδ ). In §5 we will go further still by viewing the results of EP as feeding into objective Bayesian epistemology.
3.1.0.2 Acceptance and Rejection. If we allow ourselves the language of point-valued degrees of belief, (first-order) EP can be seen to work like this. An agent has evidence which consists of some propositions ϕ1 , . . . , ϕn and information about their risk levels. He then accepts those propositions whose risk levels are below the agent’s threshold δ. This leaves him with the evidential certainties, Γδ = {ϕi : P (ϕi ) ≥ 1 − δ}. From Γδ the agent infers propositions ψ of the form P (χ) ∈ [l, u]. In turn, from these propositions the agent infers the practical certainties Γε = {χ : l ≥ 1 − ε}. This sequence of steps is depicted in Figure 1. There are two modes of reasoning that are intermeshed here: on the one hand the agent is using evidence to reason under uncertainty about the conclusion proposition ψ, and on the other he is deciding which propositions to accept and reject. The acceptance mode appears in two places: deciding which evidential propositions to accept and deciding whether to accept the proposition χ to which the conclusion ψ refers. With second-order EP, on the other hand, acceptance is delayed until all reasoning under uncertainty is completed. Then we treat acceptance as a decision problem requiring a decision-theoretic solution—e.g., accept those propositions
Evidential Probability and Objective Bayesian Epistemology
Step 1 Step 2 Step 3
Evidence Uncertain reasoning Acceptance
317
Φ = {P (ϕ) ∈ [lϕ , uϕ ]} Ψ = {P (χ) ∈ [lχ , uχ ]} {χ : decision-theoretically optimal}
Figure 2. The structure of 2oEP inferences. whose acceptance maximises expected utility.10 Coupling this solution with the use of point-valued probabilities we have second-order evidential probability (2oEP), whose inferential steps are represented in Figure 2. There are two considerations that motivate this more strict separation of uncertain reasoning and acceptance. First, such a separation allows one to chain inferences—something which is not possible in 1oEP. By ‘chaining inferences’ we mean that the results of step 2 of 2oEP can be treated as an input for a new round of uncertain reasoning, to be recombined with evidence and to yield further inferences. Only once the chain of uncertain reasoning is complete will the acceptance phase kick in. Chaining of inferences is explained in further detail in §3.2. Second, such a separation allows one to keep track of the uncertainties that attach to the evidence. To each item of evidence ϕ attaches an interval [lϕ , uϕ ] representing the risk or reliability of that evidence. In 1oEP, step 2 ensures that one works just with those propositions ϕ whose risk levels meet the threshold of acceptance. But in 2oEP there is no acceptance phase before uncertain reasoning is initiated, so one works with the entirety of the evidence, including the risk intervals themselves. While the use of this extra information makes inference rather more complicated, it also makes inference more accurate since the extra information can matter—the results of 2oEP can differ from the results of 1oEP. We adopt a decision-theoretic account of acceptance for the following reason. In 1oEP, each act of acceptance uniformly accepts those propositions whose associated risk is less than some fixed threshold: δ in step 2 and ε in step 4. (This allows statements to detach from their risk levels and play a role as logical constraints in inference.) But in practice thresholds of acceptance depend not so much on the step in the chain of reasoning as on the proposition concerned, and, indeed, the whole inferential set-up. To take a favourite example of Kyburg’s, consider a lottery. The threshold of acceptance of the proposition the lottery ticket that the seller is offering me will lose may be higher than that of the coin with a bias in favour of heads that I am about to toss will land heads and lower than that of the moon is made of blue cheese. This is because nothing may hang on the 10 Note
that maximising expected utility is not straightforward in this case since bounds on probabilities, rather than the probabilities themselves, are input into the decision problem. EPcalibrated objective Bayesianism (§5) goes a step further by determining point-valued probabilities from these bounds, thereby making maximisation of expected utility straightforward. See [Williamson, 2009] for more on the combining objective Bayesianism with a decision-theoretic account of acceptance.
318
Gregory Wheeler and Jon Williamson
coin toss (in which case a 60% bias in favour of heads may be quite adequate for acceptance), while rather a lot hangs on accepting that the moon is made of blue cheese—many other propositions that I have hitherto granted will have to be revisited if I were to accept this proposition. Moreover, if I am going to use the judgement to decide whether to buy a ticket then the threshold of acceptance of the lottery proposition should plausibly depend on the extent to which I value the prize. Given these considerations, acceptance of a proposition can fruitfully be viewed as a decision problem, depending on the decision set-up including associated utilities [Williamson, 2009]. Again, while this is more complicated than the 1oEP solution of modelling acceptance using a fixed threshold, the subtleties of a full-blown decision-theoretic account can matter to the resulting inferences.
3.2
Calculating Second-order EP
In this section we will be concerned with developing some machinery to perform uncertain reasoning in second-order evidential probability (step 2 in Figure 2). See [Haenni et al., 2010] for further details of this approach. 3.2.1
Entailment ♯
Let L be a propositional language whose propositional variables are of the form ϕ[a,b] for atomic propositions ϕ ∈ L.11 Here L is the language of (first-order) EP extended to include statements of the form P (χ) ∈ [l, u], and, for proposition ϕ of L, ϕ[a,b] is short for P (ϕ) ∈ [a, b]. Hence in L♯ we can express propositions about higher-order probabilities, e.g., P (χ) ∈ [l, u][a,b] which is short for P (P (χ) ∈ [l, u]) ∈ [a, b]. We write ϕa as an abbreviation of ϕ[a,a] . For µ, ν ∈ L♯ write µ|≈2o ν if ν deductively follows from µ by appealing to the axioms of probability and the following EP-motivated axioms: A1: Given ϕ11 , . . . , ϕ1n , if Prob(χ, {ϕ1 , . . . , ϕn }) = [l, u] is derivable by (first-order) EP then infer ψ 1 , where ψ ∈ L is the statement P (χ) ∈ [l, u]. A2: Given ψ 1 then infer χ[l,u] , where ψ ∈ L is the statement P (χ) ∈ [l, u]. Axiom A1 ensures that EP inferences carry over to 2oEP, while axiom A2 ensures that probabilities at the first-order level can constrain those at the second-order level. The entailment relation |≈2o will be taken to constitute core second-order EP . The idea is that when input evidence Φ consisting of a set of sentences of L♯ , one infers a set Ψ of further such sentences using the above consequence relation. Note that although |≈2o is essentially classical consequence with extra axioms, it is a nonmonotonic consequence relation since 1oEP is nonmonotonic. But 2oEP yields a strong logic inasmuch as it combines the axioms of probability with the rules of 11 As it stands L♯ contains uncountably many propositional variables, but restrictions can be placed on a, b to circumscribe the language if need be.
Evidential Probability and Objective Bayesian Epistemology
319
EP, and so questions of consistency arise. Will there always be some probability function that satisfies the constraints imposed by 1oEP consequences of evidence? Not always: see [Seidenfeld, 2007] for some counterexamples. Consequently, some consistency-maintenance procedure needs to be invoked to cope with such cases. (Of course, some consistency maintenance procedure will in any case be required to handle certain inconsistent sets of evidential propositions, so there may be no extra burden here.) One option is to consider probability functions satisfying (EP consequences of) maximal satisfiable subsets of evidential statements, for example. In this paper we will not commit to a particular consistency-maintenance procedure; we leave this interesting question as a topic for further research. 3.2.2
Credal Networks
This entailment relation can be implemented using probabilistic networks, as we shall now explain. For efficiency reasons, we make the following further assumptions. First we assume that P is distributed uniformly over the EP interval unless there is evidence otherwise: ′
′
A3: If Φ|≈2o χ[l,u] then P (χ[l ,u ] |Φ) = other consequences of Φ.
|[l,u]∩[l′ ,u′ ]| , |[l,u]|
as long as this is consistent with
Second, we assume that items of evidence are independent unless there is evidence of dependence: [a ,b ]
[a ,b ]
[a ,b ]
[a ,b ]
[a ,b ]
A4: If ϕ1 1 1 , . . . , ϕk k k ∈ Φ then P (ϕ1 1 1 , . . . , ϕk k k ) = P (ϕ1 1 1 ) · · · [a ,b ] P (ϕk k k ), as long as this is consistent with other consequences of Φ. These assumptions are not essential to second-order EP, but they make the probabilistic network implementation particularly straightforward.12 Note that these assumptions are default rules; when determined by A1-4, the consequence relation |≈2o is nonmonotonic. A credal network can be used to represent and reason with a set of probability functions [Cozman, 2000]. A credal network consists of (i) a directed acyclic graph whose nodes are variables A1 , . . . , An and (ii) constraints on conditional probabilities of the form P (ai | par i ) ∈ [l, u] where ai is an assignment of a value to a variable and par i is an assignment of values to its parents in the graph. It is assumed that each variable is probabilistically independent of its non-descendants conditional on its parents in the graph, written Ai ⊥ ⊥ ND i | Par i ; this assumption is known as the Markov Condition. 12 If
items of evidence are known to be dependent then the corresponding nodes will be connected by arrows in the credal network representation outlined below. Any information that helps to quantify the dependence will help determine the conditional probability distributions associated with these arrows. If P is known to be distributed non-uniformly over the EP intervals then information about its distribution will need to be used to determine conditional probability distributions in the credal net.
320
Gregory Wheeler and Jon Williamson
Credal networks are of fundamental importance for inference in probabilistic logic [Haenni et al., 2010]. A logic is a probabilistic logic if its semantic interpretations are probability functions; the entailment relation of first-order EP does not constitute a probabilistic logic in this sense, but the entailment relation |≈2o of second-order EP does. In a probabilistic logic we are typically faced with the following sort of question: given premiss propositions ϕ1 , . . . , ϕn and their respective probabilities X1 , . . . , Xn , what probability should we attach to a conclusion proposition ψ? This question can be written in the form Xn ? 1 ϕX 1 , . . . , ϕn |≈ ψ
where |≈ is the entailment relation of the probabilistic logic. For example, in second-order evidential probability we might be faced with the following question %x(F x, Rx, [.2, .4])[.9,1] , Rt |≈2o P (F t) ∈ [.2, .4]? This asks, given evidence that (i) the proposition that the frequency of attribute F in reference class R is between .2 and .4 has probability at least .9, and (ii) t falls in reference class R, what probability interval should attach to the proposition that the probability that t has attribute F is between .2 and .4? In first-order EP, if 1 − δ ≥ .9 then Prob(F t, Γδ ) = [.2, .4] would be conclusively inferred (and hence treated as if it had probability 1). Clearly this disregards the uncertainty that attaches to the statistical evidence; the question is, what uncertainty should attach to the conclusion as a consequence? (This is a second-order uncertainty; hence the name second-order evidential probability.) One can construct a credal network to answer this question as follows. Let ϕ1 be the proposition %x(F x, Rx, [.2, .4]), ϕ2 be Rt and ψ be P (F t) ∈ [.2, .4]. These can all be thought of as variables that take possible values True and False. The structure of 1oEP calculations determines the structure of the directed acyclic graph in the credal net: ϕ1 H HH
H j H ψ *
ϕ2 The conditional probability constraints involving the premiss propositions are simply their given risk levels: P (ϕ1 ) ∈ [.9, 1], P (ϕ2 ) = 1. Turning to the conditional probability constraints involving the conclusion proposition, these are determined by 1oEP inferences via axioms A1-3: P (ψ|ϕ1 ∧ ϕ2 ) = 1,
Evidential Probability and Objective Bayesian Epistemology
321
P (ψ|¬ϕ1 ∧ ϕ2 ) = P (ψ|ϕ1 ∧ ¬ϕ2 ) = P (ψ|¬ϕ1 ∧ ¬ϕ2 ) = .2. Finally, the Markov condition holds in virtue of A4, which implies that ϕ1 ⊥ ⊥ ϕ2 . Inference algorithms for credal networks can then be used to infer the uncertainty that should attach to the conclusion, P (ψ) ∈ [.92, 1]. Hence we have: %x(F x, Rx, [.2, .4])[.9,1] , Rt|≈2o P (F t) ∈ [.2, .4][.92,1] 3.2.2.1 Chaining Inferences. While it is not possible to chain inferences in 1oEP, this is possible in 2oEP, and the credal network representation can just as readily be applied to this more complex case. Consider the following question: %x(F x, Rx, [.2, .4])[.9,1] , Rt, %x(Gx, F x, [.2, .4])[.6,.7] |≈2o P (Gt) ∈ [0, .25]? As we have just seen, the first two premisses can be used to infer something about F t, namely P (F t) ∈ [.2, .4][.92,1] . But now this inference can then be used in conjunction with the third premiss to infer something about Gt. To work out the probability bounds that should attach to an inference to P (Gt) ∈ [0, .25], we can apply the credal network procedure. Again, the structure of the graph in the network is given by the structure of EP inferences: ϕ1 HH
HH j H ψ HH * HH j H ϕ2 ψ′ * ϕ3 Here ϕ3 is %x(Gx, F x, [.2, .4]) and ψ ′ is P (Gt) ∈ [0, .25]; other variables are as before. The conditional probability bounds of the previous example simply carry over P (ϕ1 ) ∈ [.9, 1], P (ϕ2 ) = 1, P (ψ|ϕ1 ∧ ϕ2 ) = 1, P (ψ|¬ϕ1 ∧ ϕ2 ) = .2 = P (ψ|ϕ1 ∧ ¬ϕ2 ) = P (ψ|¬ϕ1 ∧ ¬ϕ2 ). But we need to provide further bounds. As before, the risk level associated with the third premiss ϕ3 provides one of these: P (ϕ3 ) ∈ [.6, .7], and the constraints involving the new conclusion ψ ′ are generated by A3: P (ψ ′ |ψ ∧ ϕ3 ) =
|[.2 × .6 + .8 × .1, .4 × .7 + .6 × .1] ∩ [0, .25]| = .31, |[.2 × .6 + .8 × .1, .4 × .7 + .6 × .1]|
322
Gregory Wheeler and Jon Williamson
P (ψ ′ |¬ψ ∧ ϕ3 ) = .27, P (ψ ′ |ψ ∧ ¬ϕ3 ) = P (ψ ′ |¬ψ ∧ ¬ϕ3 ) = .25. The Markov Condition holds in virtue of A4 and the structure of EP inferences. Performing inference in the credal network yields P (ψ ′ ) ∈ [.28, .29]. Hence %x(F x, Rx, [.2, .4])[.9,1] , Rt, %x(Gx, F x, [.2, .4])[.6,.7] |≈2o P (Gt) ∈ [0, .25][.28,.29] . This example shows how general inference in 2oEP can be: we are not asking which probability bounds attach to a 1oEP inference in this example, but rather which probability bounds attach to an inference that cannot be drawn by 1oEP. The example also shows that the probability interval attaching to the conclusion can be narrower than intervals attaching to the premisses.
4
4.1
OBJECTIVE BAYESIAN EPISTEMOLOGY
Motivation
We saw above that evidential probability concerns the impact of evidence upon a conclusion. It does not on its own say how strongly one should believe the conclusion. Kyburg was explicit about this, arguing that evidential probabilities can at most be thought of as ‘real-valued bounds on degrees of belief, determined by the logical structure of our evidence’ [Kyburg, 2003, p. 147]. To determine rational degrees of belief themselves, one needs to go beyond EP, to a normative theory of partial belief. Objective Bayesian epistemology is just such a normative theory [Rosenkrantz, 1977; Jaynes, 2003; Williamson, 2005]. According to the version of objective Bayesianism presented in [Williamson, 2005], one’s beliefs should adhere to three norms: Probability: The strengths of one’s beliefs should be representable by probabilities. Thus they should be measurable on a scale between 0 and 1, and should be additive. Calibration: These degrees of belief should fit one’s evidence. For example, degrees of belief should be calibrated with frequency: if all one knows about the truth of a proposition is an appropriate frequency, one should believe the proposition to the extent of that frequency. Equivocation: One should not believe a proposition more strongly than the evidence demands. One should equivocate between the basic possibilities as far as the evidence permits. These norms are imprecisely stated: some formalism is needed to flesh them out.
Evidential Probability and Objective Bayesian Epistemology
323
4.1.0.2 Probability. In the case of the Probability norm, the mathematical calculus of probability provides the required formalism. Of course mathematical probabilities attach to abstract events while degrees of belief attach to propositions, so the mathematical calculus needs to be tailored to apply to propositions. It is usual to proceed as follows — see, e.g., [Paris, 1994]. Given a predicate language L with constants ti that pick out all the members of the domain, and sentences θ, ϕ of L, a function P is a probability function if it satisfies the following axioms: P1: If |= θ then P (θ) = 1; P2: If |= ¬(θ ∧ ϕ) then P (θ ∨ ϕ) = P (θ) + P (ϕ); Wn P3: P (∃xθ(x)) = limn→∞ P ( i=1 θ(ti )).
P1 sets the scale, P2 ensures that probability is additive, and P3, called Gaifman’s condition, sets the probability of ‘θ holds of something’ to be the limit of the probability of ‘θ holds of one or more of t1 , ..., tn ’, as n tends to infinity. The Probability norm then requires that the strengths of one’s beliefs be representable by a probability function P over (a suitable formalisation of) one’s language. Writing P for the set of probability functions over L, the Probability norm requires that one’s beliefs be representable by some P ∈ P. 4.1.0.3 Calibration. The Calibration norm says that the strengths of one’s beliefs should be appropriately constrained by one’s evidence E. (By evidence we just mean everything taken for granted in the current operating context— observations, theory, background knowledge etc.) This norm can be explicated by supposing that there is some set E ⊆ P of probability functions that satisfy constraints imposed by evidence and that one’s degrees of belief should be representable by some PE ∈ E. Now typically one has two kinds of evidence: quantitative evidence that tells one something about physical probability (frequency, chance etc.), and qualitative evidence that tells one something about how one’s beliefs should be structured. In [Williamson, 2005] it is argued that these kinds of evidence should be taken into account in the following way. First, quantitative evidence (e.g., evidence of frequencies) tells us that the physical probability function P ∗ must lie in some set P∗ of probability functions. One’s degrees of belief ought to be similarly constrained by evidence of physical probabilities, subject to a few provisos: C1: E 6= ∅. If evidence is inconsistent this tells us something about our evidence rather than about physical probability, so one cannot conclude that P∗ = ∅ and one can hardly insist that PE ∈ ∅. Instead P∗ must be determined by some consistency maintenance procedure—one might, for example, take P∗ to be determined by maximal consistent subsets of one’s evidence—and neither P∗ nor E can ever be empty.
324
Gregory Wheeler and Jon Williamson
C2: If E is consistent and implies proposition θ that does not mention physical probability P ∗ , then P (θ) = 1 for all P ∈ E. This condition merely asserts that categorical evidence be respected—it prevents E from being too inclusive. The qualification that θ must not mention physical probability is required because in some cases evidence of physical probability should be treated more pliably: C3: If P, Q ∈ P∗ and R = λP + (1 − λ)Q for λ ∈ [0, 1] then, other things being equal, one should be permitted to take R as one’s belief function PE . Note in particular that C3 implies that, other things being equal, if P ∈ P∗ then P ∈ E; it also implies C1 (under the understanding that P∗ 6= ∅). C3 is required to handle the following kind of scenario. Suppose for example that you have evidence just that an experiment with two possible outcomes, a and ¬a, has taken place. As far as you are aware, the physical probability of a is now 1 or 0 and no value in between. But this does not imply that your degree of belief in a should be 1 or 0 and no value in between—a value of 21 , for instance, is quite reasonable in this case. C3 says that, in the absence of other overriding evidence, hP∗ i ⊆ E where hP∗ i is the convex hull of P∗ . The following condition imposes the converse relation: C4: E ⊆ hP∗ i. Suppose for example that evidence implies that either P ∗ (a) = 0.91 or P ∗ (a) = 0.92. While C3 permits any element of the interval [0.91, 0.92] as a value for one’s degree of belief PE (a), C4 confines PE (a) to this interval—indeed a value outside this interval is unwarranted by this particular evidence. Note that C4 implies C2: θ being true implies that its physical probability is 1, so P (θ) = 1 for all P ∈ P∗ , hence for all P ∈ hP∗ i, hence for all P ∈ E. In the absence of overriding evidence the conditions C1–4 set E = hP∗ i. This sheds light on how quantitative evidence constrains degrees of belief, but one may also have qualitative evidence which constrains degrees of belief in ways that are not mediated by physical probability. For example, one may know about causal influence relationships involving variables in one’s language: this may tell one something about physical probability, but it also tells one other things—e.g., that if one extends one’s language to include a new variable that is not a cause of the current variables, then that does not on its own provide any reason to change one’s beliefs about the current variables. These constraints imposed by evidence of influence relationships, discussed in detail in [Williamson, 2005], motivate a further principle: C5: E ⊆ S where S is the set of probability functions satisfying structural constraints.
Evidential Probability and Objective Bayesian Epistemology
325
We will not dwell on C5 here since structural constraints are peripheral to the theme of this paper, namely to connections between objective Bayesian epistemology and evidential probability. It turns out that the set S is always non-empty, hence C1–5 yield: Calibration: One’s degrees of belief should be representable by PE ∈ E = hP∗ i∩S. 4.1.0.4 Equivocation. The third norm, Equivocation, can be fleshed out by requiring that PE be a probability function, from all those that are calibrated with evidence, that is as close as possible to a totally equivocal probability function P= called the equivocator on L. But we need to specify the equivocator and also what we mean by ‘as close as possible’. To specify the equivocator, first create an ordering a1 , a2 , . . . of the atomic sentences of L—sentences of the form U t where U is a predicate or relation and t is a tuple of constants of corresponding arity— such that those atomic sentences involving constants t1 , . . . tn−1 occur earlier in the ordering than those involving tn . Then we can define the equivocator P= by e ej−1 P= (aj j | ae11 ∧ · · · ∧ aj−1 ) = 1/2 for all j and all e1 , . . . ej ∈ {0, 1}, where a1i is just 0 ai and ai is ¬ai . Clearly P= equivocates between each atomic sentence of L and its negation. In order to explicate ‘as close as possible’ to P= we shall appeal to the standard notion of distance between probability functions, the n-divergence of P from Q: df
dn (P, Q) =
1 X
e1 ,...,ern =0
e
P (ae11 ∧ · · · ∧ arnrn ) log
e
P (ae11 ∧ · · · ∧ arnrn ) . e Q(ae11 ∧ · · · ∧ arnrn )
Here a1 , ..., arn are the atomic sentences involving constants t1 , ..., tn ; we follow the usual convention of taking 0 log 0 to be 0, and note that the n-divergence is not a distance function in the usual mathematical sense because it is not symmetric and does not satisfy the triangle inequality—rather, it is a measure of the amount of information that is encapsulated in P but not in Q. We then say that P is closer to the equivocator than Q if there is some N such that for n ≥ N , dn (P, P= ) < dn (Q, P= ). Now we can state the Equivocation norm as follows. For a set Q of probability functions, denote by ↓Q the members of Q that are closest to the equivocator P= .13 Then, E1: PE ∈ ↓E. This principle is discussed at more length in [Williamson, 2008]. It can be construed as a version of the maximum entropy principle championed by Edwin Jaynes. Note that while some versions of objective Bayesianism assume that an agent’s degrees of belief are uniquely determined by her evidence and language, we make no such assumption here: ↓E may not be a singleton.
13 If there are no closest members (i.e., if chains are all infinitely descending: for any member P of Q there is some P ′ in Q that is closer to the equivocator than P ) the context may yet determine an appropriate subset ↓Q ⊆ Q of probability functions that are sufficiently close to the equivocator; for simplicity of exposition we shall ignore this case in what follows.
326
Gregory Wheeler and Jon Williamson
A1
A2
A3
Figure 3. Constraint graph. A1
- A2
- A3
Figure 4. Graph satisfying the Markov Condition.
4.2
Calculating Objective Bayesian Degrees of Belief
Just as credal nets can be used for inference in 2oEP, so too can they be used for inference in OBE. The basic idea is to use a credal net to represent ↓E, the set of rational belief functions, and then to perform inference to calculate the range of probability values these functions ascribe to some proposition of interest. These methods are explained in detail in [Williamson, 2008]; here we shall just give the gist. For simplicity we shall describe the approach in the base case in which the evidence consists of interval bounds on the probabilities of sentences of the agent’s language L, E = {P ∗ (ϕi ) ∈ [li , ui ] : i = 1, . . . , k}, E is consistent and does not admit infinite descending chains; but these assumptions can all be relaxed. In this case E = hP∗ i ∩ S = P∗ . Moreover, the evidence can be written in the language L♯ [l ,u ] [l ,u ] introduced earlier: E = {ϕ1 1 1 , . . . , ϕk k k }, and the question facing objective Bayesian epistemology takes the form [l ,u1 ]
ϕ1 1
[l ,uk ]
, . . . , ϕk k
|≈OBE ψ ?
where |≈OBE is the entailment relation defined by objective Bayesian epistemology as outlined above. As explained in [Williamson, 2008], this entailment relation is nonmonotonic but it is well-behaved in the sense that it satisfies all the System-P properties of nonmonotonic logic. The method is essentially this. First construct an undirected graph, the constraint graph, by linking with an edge those atomic sentences that appear in the same item of evidence. One can read off this graph a list of probabilistic independencies that any function in ↓E must satisfy: if node A separates nodes B and C in this graph then B ⊥ ⊥ C | A for each probability function in ↓E. This constraint graph can then be transformed into a directed acyclic graph for which the Markov Condition captures many or all of these independencies. Finally one can calculate bounds on the probability of each node conditional on its parents in the graph by using entropy maximisation methods: each probability function in ↓E maximises entropy subject to the constraints imposed by E, and one can identify the probability it gives to one variable conditional on its parents using numerical optimisation methods [Williamson, 2008].
Evidential Probability and Objective Bayesian Epistemology
327
To take a simple example, suppose we have the following question: 3/5
∀x(U x → V x)
, ∀x(V x → W x)
3/4
, U t1 [0.8,1] |≈OBE W t?1
A credal net can be constructed to answer this question. There is only one constant symbol t1 , and so the atomic sentences of interest are U t1 , V t1 , W t1 . Let A1 be U t1 , A2 be V t1 and A3 be W t1 . Then the constraint graph G is depicted in Fig. 3 and the corresponding directed acyclic graph H is depicted in Fig. 4. It is not hard to see that P (A1 ) = 4/5, P (A2 |A1 ) = 3/4, P (A2 |¬A1 ) = 1/2, P (A3 |A2 ) = 5/6, P (A3 |¬A2 ) = 1/2; together with H, these probabilities yield a credal network. (In fact, since the conditional probabilities are precisely determined rather than bounded, we have a special case of a credal net called a Bayesian net.) The Markov Condition holds since separation in the constraint graph implies probabilistic independence. Standard inference methods then give us P (A3 ) = 11/15 as an answer to our question. 5
5.1
EP-CALIBRATED OBJECTIVE BAYESIANISM
Motivation
At face value, evidential probability and objective Bayesian epistemology are very different theories. The former concerns the impact of evidence of physical probability, Carnap’s probability2 , and concerns acceptance and rejection; it appeals to interval-valued probabilities. The latter theory concerns rational degree of belief, probability1 , and invokes the usual point-valued mathematical notion of probability. Nevertheless the core of these two theories can be reconciled, by appealing to second-order EP as developed above. 2oEP concerns the impact of evidence on rational degree of belief. Given statistical evidence, 2oEP will infer statements about rational degrees of belief. These statements can be viewed as constraints that should be satisfied by the degrees of belief of a rational agent with just that evidence. So 2oEP can be thought of as mapping statistical evidence E to a set E of rational belief functions that are compatible with that evidence. (This is a non-trivial mapping because frequencies attach to a sequence of outcomes or experimental conditions that admit repeated instantiations, while degrees of belief attach to propositions. Hence the epistemological reference-class problem arises: how can one determine appropriate single-case probabilities from information about generic probabilities? Evidential probability is a theory that tackles this reference-class problem head on: it determines a probability interval that attaches to a sentence from statistical evidence about repetitions.) But this mapping from E to E is just what is required by the Calibration norm of OBE. We saw in §4 that OBE maps evidence E to E = hP∗ i ∩ S, a set of probability functions calibrated with that evidence. But no precise details were given as to how hP∗ i, nor indeed P∗ , is to be determined. In special cases this is
328
Gregory Wheeler and Jon Williamson
straightforward. For example, if one’s evidence is just that the chance of a is 12 , P ∗ (a) = 1/2, then hP∗ i = P∗ = {P ∈ P : P (a) = 1/2}. But in general, determining hP∗ i is not a trivial enterprise. In particular, statistical evidence takes the form of information about generic frequencies rather than single-case chances, and so the reference-class problem arises. It is here that 2oEP can be plugged in: if E consists of propositions of L♯ —i.e., propositions, including statistical propositions, to which probabilities or closed intervals of probabilities attach—then hP∗ i is the set of probability functions that satisfy the |≈2o consequences of E. C6: If E is a consistent set of propositions of L♯ then hP∗ i = {P : P (χ) ∈ [l, u] for all χ, l, u such that E|≈2o χ[l,u] }. We shall call OBE that appeals to calibration principles C1–6 epistemic-probabilitycalibrated objective Bayesian epistemology, or EP-OBE for short. We shall denote the corresponding entailment relation by |≈EP OBE . We see then that there is a sense in which EP and OBE can be viewed as complementary rather than in opposition. Of course, this isn’t the end of the matter. Questions still arise as to whether EP-OBE is the right way to flesh out OBE. One can, for instance, debate the particular rules that EP uses to handle reference classes (§2.2). One can also ask whether EP tells us everything we need to know about calibration. As mentioned in §4.1, further rules are needed in order to handle structural evidence, fleshing out C5. Moreover, both 1oEP and 2oEP take statistical statements as input; these statements themselves need to be inferred from particular facts—indeed EP, OBE and EP-OBE each presume a certain amount of statistical inference. Consequently we take it as understood that Calibration requires more than just C1–6. And questions arise as to whether the alterations to EP that are necessary to render it compatible with OBE are computationally practical. Second-order EP replaces the original theory of acceptance with a decision theoretic account which will incur a computational burden. Moreover, some thought must be given as to which consistency maintenance procedure should be employed in practice. Having said this, we conjecture that there will be real inference problems for which the benefits will be worth the necessary extra work.
5.2
Calculating EP-Calibrated Objective Bayesian Probabilities
Calculating EP-OBE probabilities can be achieved by combining methods for calculating 2oEP probabilities with methods for calculating OBE probabilities. Since credal nets can be applied to both formalisms independently, they can also be applied to their unification. In fact in order to apply the credal net method to OBE, some means is needed of converting statistical statements, which can be construed as constraints involving generic, repeatably-instantiatable variables, to constraints involving the single-case variables which constitute the nodes of the objective Bayesian credal net; only then can the constraint graph of §4.2 be constructed. The 2oEP credal nets of §3.2 allow one to do this, since this kind
Evidential Probability and Objective Bayesian Epistemology
329
of net incorporates both statistical variables and single-case variables as nodes. Thus 2oEP credal nets are employed first to generate single-case constraints, at which stage the OBE credal net can be constructed to perform inference. This fits with the general view of 2oEP as a theory of how evidence constrains rational degrees of belief and OBE as a theory of how further considerations—especially equivocation—further constrain rational degrees of belief. Consider the following very simple example: %x(F x, Rx, [.2, .5]), Rt, ∀x(F x → Gx)
3/4
? |≈EP OBE Gt
Now the first two premisses yield F t[.2,.5] by EP. This constraint combines with the third premiss to yield an answer to the above question by appealing to OBE. This answer can be calculated by constructing the following credal net: ϕ HH
HH j H - Gt Ft * Rt
Here ϕ is the first premiss. The left-hand side of this net is the 2oEP net, with associated probability constraints P (ϕ) = 1, P (Rt) = 1, P (F t|ϕ ∧ Rt) ∈ [.2, .5], P (F t|¬ϕ ∧ Rt) = 0 = P (F t|ϕ ∧ ¬Rt) = P (F t|¬ϕ ∧ ¬Rt). The right-hand side of this net is the OBE net with associated probabilities P (Gt|F t) = 7/10, P (Gt|¬F t) = 1/2. Standard inference algorithms then yield an answer of 7/12 to our question: 3/4
%x(F x, Rx, [.2, .5]), Rt, ∀x(F x → Gx) 6
7/12 |≈EP OBE Gt
CONCLUSION
While evidential probability and objective Bayesian epistemology might at first sight appear to be chalk and cheese, on closer inspection we have seen that their relationship is more like horse and carriage—together they do a lot of work, covering the interface between statistical inference and normative epistemology.
330
Gregory Wheeler and Jon Williamson
Along the way we have taken in an interesting array of theories—first-order evidential probability, second-order evidential probability, objective Bayesian epistemology and EP-calibrated OBE—that can be thought of as nonmonotonic logics. 2oEP and OBE are probabilistic logics in the sense that they appeal to the usual mathematical notion of probability. More precisely, their entailment relations are probabilistic: premisses entail the conclusion if every model of the premisses satisfies the conclusion, where models are probability functions. This connection with probability means that credal networks can be applied as inference machinery. Credal nets yield a perspicuous representation and the prospect of more efficient inference [Haenni et al., 2010]. ACKNOWLEDGEMENTS We are grateful to the Leverhulme Trust for supporting this research, and to Prasanta S. Bandyopadhyay, Teddy Seidenfeld and an anonymous referee for helpful comments. BIBLIOGRAPHY [Andr´ eka et al., 1998] H. Andr´ eka, J. van Benthem, and I. N´ emeti. Modal languages and bounded fragments of predicate logic. Journal of Philosophical Logic, 27:217–274, 1998. [Arl´ o-Costa and Parikh, 2005] H. Arl´ o-Costa and R. Parikh. Conditional probability and defeasible inference. Journal of Philosophical Logic, 34:97–119, 2005. [Carnap, 1962] R. Carnap. The Logical Foundations of Probability. University of Chicago Press, 2nd edition, 1962. [Cozman, 2000] F. G. Cozman. Credal networks. Artificial Intelligence, 120:199–233, 2000. [de Finetti, 1974] B. de Finetti. Theory of Probability: A critical introductory treatment. Wiley, 1974. [Dubbins, 1975] L. E. Dubbins. Finitely additive conditional probability, conglomerability, and disintegrations. Annals of Probability, 3:89–99, 1975. [Dubois and Prade, 1980] D. Dubois and H. Prade. Fuzzy Sets and Systems: Theory and Applications. Kluwer, North Holland, 1980. [Haenni et al., 2010] R. Haenni, J.-W. Romeijn, G. Wheeler, and J. Williamson. Probabilistic Logic and Probabilistic Networks. The Synthese Library, Springer, 2010. [Harper and Wheeler, 2007] W. Harper and G. Wheeler, editors. Probability and Inference: Essays In Honor of Henry E. Kyburg, Jr. King’s College Publications, London, 2007. [Hawthorne and Makinson, 2007] J. Hawthorne and D. Makinson. The quantitative/qualitative watershed for rules of uncertain inference. Studia Logica, 86(2):247–297, 2007. [Jaynes, 2003] E. T. Jaynes. Probability theory: the logic of science. Cambridge University Press, Cambridge, 2003. [Keynes, 1921] J. M. Keynes. A Treatise on Probability. Macmillan, London, 1921. [Kraus et al., 1990] S. Kraus, D. Lehman, and M. Magidor. Nonmonotonic reasoning, preferential models and cumulative logics. Artificial Intelligence, 44:167–207, 1990. [Kyburg and Teng, 2001] H. E. Kyburg, Jr. and C. M. Teng. Uncertain Inference. Cambridge University Press, Cambridge, 2001. [Kyburg et al., 2007] H. E. Kyburg, Jr., C. M. Teng, and G. Wheeler. Conditionals and consequences. Journal of Applied Logic, 5(4):638–650, 2007. [Kyburg, 2003] H. E. Kyburg Jr. Are there degrees of belief? Journal of Applied Logic, 1:139– 149, 2003. [Kyburg, 1961] H. E. Kyburg, Jr. Probability and the Logic of Rational Belief. Wesleyan University Press, Middletown, CT, 1961.
Evidential Probability and Objective Bayesian Epistemology
331
[Kyburg, 1990] H. E. Kyburg, Jr. Science and Reason. Oxford University Press, New York, 1990. [Kyburg, 2007] H. E. Kyburg, Jr. Bayesian inference with evidential probability. In William Harper and Gregory Wheeler, editors, Probability and Inference: Essays in Honor of Henry E. Kyburg, Jr., pages 281–296. King’s College, London, 2007. [Levi, 1977] I. Levi. Direct inference. Journal of Philosophy, 74:5–29, 1977. [Levi, 1980] I. Levi. The Enterprise of Knowledge. MIT Press, Cambridge, MA, 1980. [Levi, 2007] I. Levi. Probability logic and logical probability. In William Harper and Gregory Wheeler, editors, Probability and Inference: Essays in Honor of Henry E. Kyburg, Jr., pages 255–266. College Publications, 2007. [Paris, 1994] J. B. Paris. The uncertain reasoner’s companion. Cambridge University Press, Cambridge, 1994. [Pawlak, 1991] Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer, Dordrecht, 1991. [Rosenkrantz, 1977] R. D. Rosenkrantz. Inference, method and decision: towards a Bayesian philosophy of science. Reidel, Dordrecht, 1977. [Seidenfeld, 2007] T. Seidenfeld. Forbidden fruit: When Epistemic Probability may not take a bite of the Bayesian apple. In William Harper and Gregory Wheeler, editors, Probability and Inference: Essays in Honor of Henry E. Kyburg, Jr. King’s College Publications, London, 2007. [Teng, 2007] C. M. Teng. Conflict and consistency. In William L. Harper and Gregory Wheeler, editors, Probability and Inference: Essays in Honor of Henry E. Kyburg, Jr. King’s College Publications, London, 2007. [Walley, 1991] P. Walley. Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, London, 1991. [Wheeler, 2004] G. Wheeler. A resource bounded default logic. In James Delgrande and Torsten Schaub, editors, NMR 2004, pages 416–422, 2004. [Wheeler, 2007] G. Wheeler. Two puzzles concerning measures of uncertainty and the positive Boolean connectives. In Jos´ e Neves, Manuel Santos, and Jos´e Machado, editors, Progress in Artificial Intelligence, 13th Portuguese Conference on Artificial Intelligence, LNAI 4874, pages 170–180, Berlin, 2007. Springer-Verlag. [Williamson, 2005] J. Williamson. Bayesian nets and causality: philosophical and computational foundations. Oxford University Press, Oxford, 2005. [Williamson, 2008] J. Williamson. Objective Bayesian probabilistic logic. Journal of Algorithms in Cognition, Informatics and Logic, 63:167–183, 2008. [Williamson, 2009] J. Williamson. Aggregating judgements by merging evidence. Journal of Logic and Computation, 19(3): 461–473, 2009.
This page intentionally left blank
CONFIRMATION THEORY James Hawthorne
INTRODUCTION Confirmation theory is the study of the logic by which scientific hypotheses may be confirmed or disconfirmed (or supported or refuted) by evidence. A specific theory of confirmation is a proposal for such a logic. Presumably the epistemic evaluation of scientific hypotheses should largely depend on their empirical content — on what they say the evidentially accessible parts of the world are like, and on the extent to which they turn out to be right about that. Thus, all theories of confirmation rely on measures of how well various alternative hypotheses account for the evidence.1 Most contemporary confirmation theories employ probability functions to provide such a measure. They measure how well the evidence fits what the hypothesis says about the world in terms of how likely it is that the evidence would occur if the hypothesis were true. Such hypothesis-based probabilities of evidence claims are called likelihoods. Clearly, when the evidence is more likely according to one hypothesis than according to an alternative, that should redound to the credit of the former hypothesis and the discredit of the later. But various theories of confirmation diverge on precisely how this credit is to be measured. A natural approach is to also employ a probabilistic measure to directly represent the degree to which the hypothesis is confirmed or disconfirmed on the evidence. The idea is to rate the degree to which a hypothesis is confirmed on a scale from 0 to 1, where tautologies are always assigned maximal confirmation (degree 1), and where the degree of confirmation of the disjunction of mutually incompatible hypotheses sum to the degrees of confirmation of each taken separately. This way of rating confirmation just recapitulates the standard axioms 1 I make no distinction here between scientific hypotheses and scientific theories. For our purposes a scientific theory is just a large, complex hypothesis. We will suppose (as a formal logic must) that scientific hypotheses are expressible as sentences of a language — e.g. a mathematically sophisticated dialect of English. This supposition need not be in opposition to a semantic view of theories. Presumably, if scientists can express a theory well enough to agree about what it says about the world (or at least about its testable empirical content), it must be expressible in some bit of language. For instance, if a theory is taken to be a set of models, presumably that set of models may be described in mathematical English, perhaps drawing on set theory. In that case the empirical content of the theory will consist of various hypotheses about what parts of the world are modeled (to what degree of approximation) by some particular model or set of models. Such hypotheses should be subject to empirical evaluation. So a theory of confirmation should apply to them.
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
334
James Hawthorne
of probability theory, but applies them as a measure of evidential support. Any theory of confirmation that employs such a measure is a probabilistic confirmation theory. However, confirmation functions of this sort will be of little value unless it can be shown that under reasonable conditions the accumulation of evidence tends to drive the confirmation values given by these functions towards 0 for false hypotheses and towards 1 for true hypotheses. How should confirmation values be related to what hypotheses imply about evidence claims via the likelihoods? The most straightforward idea would be to have the confirmation function assign to a hypothesis whatever numerical value is had by the likelihood the hypothesis assigns to the evidence. However, this idea won’t work. For one thing, in cases where the hypothesis logically entails the evidence, the likelihood is 1. But we cannot require the confirmational probability of each hypothesis that entails the evidence to be 1. For, several alternative hypotheses may entail the evidence, and a probabilistic confirmation measure cannot assign probability 1 to two (or more) alternative hypotheses based on the same evidence.2 If both likelihoods and the degrees of confirmation for hypotheses are to be measured probabilistically, it seems natural to represent both by a common probability function. In that case what relationship does the degree to which a hypothesis is confirmed on evidence have to the likelihood of the evidence according to the hypothesis? This is where Bayes’ Theorem comes into play. Bayes’ Theorem follows from the standard axioms for probabilities; and it explicitly shows how the probabilistic degree of confirmation for a hypothesis depends on the likelihoods of evidence claims. Thus, any confirmation measure that satisfies the standard axioms of probability theory and employs the same probability function to represent likelihoods will have to be a Bayesian confirmation measure. Any theory of confirmation that employs such a probabilistic measure of confirmation will thus be a Bayesian confirmation theory. Various Bayesian approaches to confirmation primarily diverge with regard to how they understand the concept of probabilistic confirmation — i.e. with regard to how they interpret the notion of probability that is supposed to be captured by confirmational probability functions. Is the confirmation function supposed to represent the warranted degree of confidence or degree of belief an agent should have in hypotheses (based on the evidence), as the subjectivists and personalists would have it? Or, is the confirmation function some kind of objective logical relationship by which the evidence is supposed to probabilistically entail hypotheses, as some logical objectivists maintain? Or, might the confirmation function represent some other coherent conception of evidential support? These days the most prominent strain of Bayesian confirmation theory takes probabilistic confirmation functions to represent the rationally ideal agent’s subjective degrees of confidence or degrees of belief in the truth of statements or propositions. This view has become so influential that the term ‘Bayesian confir2 The same argument applies whenever two alternative hypotheses assign the evidence a likelihood greater than 1/2 — and similarly, whenever n alternative hypotheses each assign the evidence likelihoods greater than 1/n.
Confirmation Theory
335
mation theory’ is often taken to be synonymous with it. But to identify Bayesian confirmation theory with the subjectivist view is a mistake. It either mischaracterizes or entirely disregards a host of non-subjectivist Bayesian accounts. Properly speaking, it is not the subjective interpretation of the confirmation function that makes a confirmation theory Bayesian. Rather, any confirmation theory that gives Bayes’ Theorem a central role in the representation of how a hypothesis is confirmed or refuted by evidence is a Bayesian confirmation theory. And any account that employs the same probability function to represent the hypothesis-based likelihoods of evidence claims and to provide a probabilistic measure of the support strength for those hypotheses by that evidence must give Bayes’ Theorem this central role. Historically a number of proposals for objective Bayesian confirmation theories have been developed — e.g., by Keynes [1921], Carnap [1950; 1952; 1971; 1980], Jeffreys [1939], Jaynes [1968]. More recently, Rosenkrantz [1981], Maher [1996; 2006], Fitelson [2005], and Williamson [2007] defend objectivist Bayesian accounts of confirmation. The founding proponents of subjectivist Bayesianism include Ramsey [1926], de Finetti [1937], and Savage [1954]. Prominent contemporary subjectivist treatments include those of Jeffrey [1965; 1992], Levi [1967; 1980], Lewis [1980], Skyrms [1984], Earman [1992], Howson and Urbach [1993], Howson [1997] and Joyce [1999; 2003; 2004; 2005]. Although all these subjectivist Bayesian approaches draw on a belief-strength notion of probability, there remain important differences among them. For instance, all agree that belief strengths should be probabilistically coherent (i.e., consistent with the probability axioms), but some maintain that belief strengths should also satisfy additional epistemic constraints. In this article I will explicate the probabilistic logic underlying most probabilistic accounts of evidential support. I’ll also discuss some difficulties faced by objectivist and subjectivist readings of confirmation functions, and suggest a conception that I think overcomes the interpretative problems. Here is how I’ll proceed. In section 1 I will discuss the most basic features of this probabilistic logic. I’ll set down axioms that characterize these functions, but without yet describing how the logic is supposed to apply to the evaluation of scientific hypotheses. Section 2 will briefly describe two accounts of the nature of confirmation functions — views about what the confirmation functions are supposed to be, or what they are supposed to represent. We won’t be able to delve too far into this issue until after we see how the logic of confirmation functions represents the evidential support of hypotheses, which I’ll address in section 3. But the logic of confirmation described is section 3 will be more easily comprehended if we have some preliminary idea about what the confirmation functions represent. Section 3 shows how the logic of confirmation functions represents the evidential support of hypotheses. I will spell out several forms of Bayes’ Theorem, showing how the Bayesian formulae represent the role of the likelihoods and another important factor, prior probabilities of hypotheses, in the logic of evidential support. In section 4 I will return to the issue of what confirmation functions are supposed
336
James Hawthorne
to be. I’ll describe major problems with the two most prominent interpretations, and suggest an alternative view. In particular I’ll address the issue of how confirmation functions are supposed to inform our beliefs about the truth or falsity of hypotheses. Is there any good reason to think that on a suitable amount of evidence, false hypothesis will become strongly disconfirmed and true hypotheses will become highly confirm? In section 5 I will explicate a Bayesian Convergence Theorem that provides such assurances. It shows that under reasonable conditions the accumulation of evidence should result in the near refutation of false hypotheses — in confirmational probability approaching 0 — and should lead to a high degree of confirmation for the true alternative — to confirmational probability near 1.3 The discussion throughout sections 3 through 5 will suppose that the likelihoods for evidence statements on the various hypotheses, which express the empirical content of hypotheses, are either objectively determinate or intersubjectively agreed to by the relevant scientific community — so all confirmation functions employed by a scientific community will agree on the numerical values of the likelihoods. In section 6 I will weaken this supposition, but argue that the important truth-acquiring features of probabilistic confirmation theory are nevertheless maintained. 1
THE PROBABILISTIC LOGIC OF CONFIRMATION FUNCTIONS
Confirmation functions represent a logic by which scientific theories are refuted or supported by evidence. So they should be defined on an object language rich enough to fully express any scientific theory. A language with the expressive strength of first-order predicate logic should suffice.4 All current scientific theories are expressible in such a language. Notation for the standard logical connectives 3 This Bayesian Convergence Theorem avoids many of the usual objections to such theorems. It depends only on likelihoods; so those suspicious of prior probabilities may still find it edifying. And this theorem provides explicit bounds on the likely rate of convergence. 4 Everything I’ll say here also applies to languages for second-order and higher-order logics. Some confirmation theorists take confirmation functions to apply directly to “propositions” rather than to sentences of a language, where propositions are supposed to be sets of possible worlds, or some other sort of non-linguistic entities. I prefer to define confirmation functions directly on a language. There are a number of reasons for preferring this approach. Among them are: (1) it avoids the sticky metaphysical issue of what a proposition is (indeed, on many accounts, all logical truths represent the same proposition — so a logic that directly employs such propositions may hide important logical structure); (2) it makes the logic more directly comparable to deductive logics; (3) although there is widespread agreement about how formal languages and their logics work, this may not be true for propositions; (4) all real scientific theories are expressed in language (including the language of set theory and mathematics), so if the results of a propositional treatment of confirmation are to be applicable to real scientific theories, they will have to be translated into results that apply to expressions in a language; (4) to the extent that the propositional approach is richer, and permits the proof of stronger results than the sentential approach, it isn’t at all clear that these stronger results are relevant to the confirmation of real scientific theories, which are always expressed in a language. So, when important results are stated and proved in terms of propositions, one may be left wondering what sort assumptions about the nature of propositions play a crucial role, and whether these results can really be translated into results about the confirmation of real scientific theories.
Confirmation Theory
337
and quantifiers for this language are as follows: “not”, ‘∼’; “and”, ‘·’; “or”, ‘∨’; truth-functional “if-then”, ‘⊃’; “if and only if”, ‘≡’; the quantifiers “all”, ‘∀’, and “some”, ‘∃’; and the identity relation, “is the same thing as”, ‘=’. These are the logical terms of the language. Standard deductive logic depends only on the meanings of these logical terms. The other terms of a language (e.g., names and predicate and relation expressions) are called non-logical terms. The central logical notion of standard deductive logic, the logical entailment relation, neither depends on their meanings nor on the actual truth-values of sentences composed of these terms. It only supposes that the non-logical terms are meaningful, and that sentences containing them must be either true or false.5 A degree of confirmation function represents a relationship between statements (i.e. declarative sentences) that is somewhat analogous to the deductive logical entailment relation. However, the logic of confirmation will have to deviate from the deductive paradigm in several significant ways. For one thing, deductive logical entailment is an absolute relationship between sentences, while confirmation comes in degrees. Deductive logical entailment is monotonic: when B logically entails A, adding a premise C cannot undermine that logical entailment — i.e., (C ·B) must logically entail A as well. But confirmation is nonmonotonic. Adding a new premise C to B may substantially raise the degree to which A is confirmed, or may substantially lower it, or may leave it unchanged — i.e., for a confirmation function Pα , the value of Pα [A|C ·B] may be much larger than Pα [A|B] (for some statements C), while it may be much smaller (for some other C), and it may have the same value, or nearly the same value (for yet other statements C). Another very significant difference is this. A given deductive logic specifies a unique logical entailment relation, and that relation depends only on the meanings of the logical connectives and quantifiers. Is there, similarly, a uniquely good confirmation function? And does it depend only on the meanings of logical terms, and not on the specifics of what the individual sentences mean (due to the meanings of the names and predicates they contain)? Most confirmation theorists would answer “no” to both questions. (I’ll discuss some reasons for this later.) Rather, from the logician’s point of view, confirmation functions are technically somewhat analogous to truth-value assignments to sentences of a formal language. That is, holding the meanings of the logical terms (connectives and quantifiers) fixed, there are lots of ways to assign meanings to names and predicate terms of a language; and for each such meaning assignment, each of the various ways the world might turn out to be assigns a corresponding truth-value assignment to each sentence of the language. Similarly, keeping the meanings of logical terms fixed, each way of 5 Any language with this syntax permits the expression of set theory and all of mathematics, so it should be sufficiently rich to express all statements made by real scientific theories. But if you doubt the adequacy of this language for the expression of real scientific theories, then please think of the logic of confirmation presented here as a logician’s model of the real thing. Viewed this way, the logician’s model is, like any model employed by the sciences, an approximation that is intended to capture the essential features of its subject matter, which in this case is the nature of correct scientific inferences concerning hypotheses expressible in a scientific dialect of a natural language such as English.
338
James Hawthorne
assigning meanings to names and predicate expressions may give rise to a distinct confirmation function; and more than one confirmation function may apply to each fully meaningful language. So there are many possible confirmation functions. In deductive logic the possible truth-value assignments to sentences of a formal language L are constrained by certain semantic rules, which are axioms regarding the meanings of the logical terms (‘not’, ‘and’, ‘or’, etc., the quantifiers, and the identity relation). The rules, or axioms, for confirmation functions play a similar role. They constrain each member of the family of possible confirmation functions, {Pβ , Pγ , . . ., Pδ , . . .}, to respect the meanings of the logical terms, but without regard for what the other terms of the language may mean. Although each confirmation function satisfies the same basic axioms, the further issue of which among them provides an appropriate measure of confirmation is not settled by these axioms alone. It presumably depends on additional factors, including the meanings of the non-logical terms of the language. Here are the semantic rules (or axioms) that constrain probabilistic confirmation functions.6 Let L be a language for predicate logic with identity, and let ‘’ be the standard logical entailment relation (where ‘B A’ abbreviates ‘B logically entails A’, and ‘ A’ abbreviates ‘A is a logical truth’). A confirmation function is a function Pα from pairs of sentences of L to real numbers between 0 and 1 that satisfies the following rules: 1. Pα [D|E] < 1 for at least one pair of sentences D and E. For all sentence A, B, C, 2. if B A, then Pα [A|B] = 1; 3. If (B ≡ C), then Pα [A|B] = Pα [A|C];
4. If C ∼(A·B), then either Pα [(A ∨B)|C] = Pα [A|C] + Pα [B|C], or for every sentence D, Pα [D|C] = 1;
5. Pα [(A·B)|C] = Pα [A|(B ·C)] × Pα [B|C]. This axiomatization takes conditional probability as basic. The conditional probability functions it characterizes are identical to the usual unconditional probability functions when the latter are defined: just let Pα [A] = Pα [A|(D∨∼D)]. But, these axioms permit conditional probabilities Pα [A|C] to remain defined even when a condition statement C has probability 0 (e.g., even when Pα [C|(D∨∼D)] = 0). Though useful, this feature is not the primary reason for taking conditional probability to be basic when specifying confirmation functions. 6 The language L in which scientific hypotheses and evidence claims are expressed is what logicians call the object-language. As logicians see it, confirmation functions are not themselves part of the object-language L. Rather, we take confirmation functions, along with other semantic notions like truth and logical entailment, to reside in the metalanguage, which is the language where properties of object-language expressions are treated. Logicians distinguish object-language from metalanguage in order to avoid certain kinds of paradoxes and logical incoherencies that can arise from applying semantic concepts to themselves.
Confirmation Theory
339
The primary reason is this. On the usual account, where unconditional probability is basic, conditional probability is defined as follows: P [A|B] = P [(A·B)]/P [B] if P [B] > 0, and is undefined otherwise. (This is closely related to axiom 5.) But if one takes conditional probability as defined in this way, the likelihood a hypothesis H assigns to some evidence statement E under experimental or observation conditions C, P [E|H·C], must be defined as follows: P [E|H·C] = P [E·H·C]/P [H·C]. However, in the context of confirmation functions it seems really unnatural to take such likelihoods as defined like this. Likelihoods often have very well-defined, well-known values all on their own, whereas the values of the probabilities in the numerator and denominator of ‘P [E ·H ·C]/P [H · C]’ are often unknown or only vaguely specified. For example, let H say “the coin is fair” (i.e. it has a propensity to come up heads half the time when tossed in the usual way), let C say “the coin is tossed at present in the usual way”, and let E say “the coin lands heads on the present toss”. We may not be at all clear about the values of P [E ·H ·C] (the probability that “the coin is fair, is presently tossed in the usual way, and lands heads”) or of P [H·C] (the probability that “the coin is fair, and is presently tossed in the usual way”). Nevertheless, the value of P [E|H · C] seems to be perfectly well-defined and well-known (it should clearly equal 1/2), due to what H · C says about E.7 Thus, because confirmation functions are supposed to represent such relationships between hypotheses and evidence statements, and because such likelihoods are often better defined and better known than the probabilities that would define them via the usual ratio definition, it seems more natural to axiomatize confirmational probabilities in a way that takes conditional probabilities as basic.8 One important context where the present approach is especially useful is this. Consider a statistical hypothesis that says that the chance (or measure) of an attribute X among systems in a state Y is 1. Formally, we might express such a hypothesis this way: ‘Ch(X, Y ) = 1’. Suppose it’s known that a physical system g is in state Y (i.e., g ∈ Y ). This gives rise to a likelihood of the following form: Pα [g ∈ X|Ch(X, Y ) = 1 · g ∈ Y ] = 1. Adding additional information to the premise may decrease this likelihood — e.g., Pα [g ∈ X|Ch(X, Y ) = 1 · g ∈ Y · ∼g ∈ X] = 0. The present approach to confirmation functions permits this, whereas on the usual approach to conditional probability this “lowering of the likelihood” cannot happen when the likelihood has value 1.9 7 The idea that confirmation functions should agree on the values of such likelihoods is an additional supposition, not captured by the rules (axioms) for confirmational probabilities given above. I’ll say more about the nature of these direct inference likelihoods a bit later. 8 See [H´ ajek, 2003a] for more reasons to take conditional probability as basic. Although an account of confirmation functions that takes conditional probability as basic makes good conceptual sense, the account of Bayesian confirmation in this article will not rely on this approach in any crucial way. The more usual axiomatization will suffice for most purposes. It goes like this: a confirmation function is a function Pα from sentences of L to real numbers between 0 and 1 that satisfies the following rules: (1) if A, then Pα [A] = 1; (2) if ∼ (A · B), then Pα [A ∨B] = Pα [A] + Pα [B]; by definition Pα [A|B] = Pα [A·B]/Pα [B] whenever Pα [B] > 0. 9 For, if by definition P [g ∈ X|Ch(X, Y ) = r · g ∈ Y ] = P [g∈X · Ch(X, Y )=1 · g∈Y ]/ α α Pα [Ch(X, Y ) = r · g ∈ Y ], then whenever Pα [g ∈ X|Ch(X, Y ) = 1 · g ∈ Y ] = 1 we must have
340
James Hawthorne
Let us now briefly consider each axiom, to see what sort of constraint it places on a measure of confirmation. First, notice that adopting a scale between 0 and 1 is merely a convenience. This scale is usual for probabilities, but another scale might do as well. Rule (1) is a non-triviality requirement. It says that some sentences must be supported by others to degree less than 1. We might instead have required that Pα [(A · ∼A)|(A ∨ ∼A)] < 1; but this turns out to be derivable from Rule (1) together with the other rules. Rule (2) says that if B logically entails A, then B must maximally confirm A. This makes each probabilistic confirmation function a kind of generalization of the deductive logical entailment relation. Rule (3) adds the plausible constraint that whenever statements B and C are logically equivalent, each must provide precisely the same confirmational support to all other statements. Rule (4) says that confirmational support adds up in a plausible way. When C logically entails the incompatibility of A and B, the support C provides each separately must sum to the support it provides for their disjunction. The only exception is the case where C acts like a contradiction and supports every sentence to degree 1. To understand what Rule (5) says, think of a confirmation function Pα as describing a measure of the proportion of possible states of affairs (or possible worlds) in which statements are true: an expression of form ‘Pα [C|D] = r’ says that the proportion of states in which C is true among those where D is true is r. On this reading, Rule (5) says the following: suppose that B (together with C) is true in proportion q of all states where C is true; and suppose that A is true in fraction r of all those states where B and C are true together; then A and B (and C) will be true in fraction r of the proportion q (i.e. in r × q) of all states where C is true. All of the usual theorems of probability theory are easily derived from these axioms. For example, logically equivalent sentences are always supported to the same degree: if C (B ≡ A), then Pα [A|C] = Pα [B|C]. And the following generalization of the Addition Rule (4) is derivable: Pα [(A ∨ B)|C] = Pα [A|C] + Pα [B|C] − Pα [(A·B)|C]. It also follows that if {B1 , . . ., Bn , . . .} is any countable set of sentences that are mutually exclusive, given C (i.e., for pair Bi and Bj , C ∼(Bi ·Bj )), then Peach ∞ limn Pα [(B1 ∨ B2 ∨ . . . ∨ Bn )|C] = i=1 Pα [Bi |C] (unless Pα [D|C] = 1 for every sentence D).10 Pα [g∈X ·Ch(X, Y )=1·g∈Y ] = Pα [Ch(X, Y )=1·g∈Y ] = Pα [(g∈X∨∼g∈X) · Ch(X, Y )=1 · g∈Y ] = Pα [g∈X ·Ch(X, Y )=1·g∈Y ]+Pα [∼g ∈X ·Ch(X, Y )=1·g∈Y ]. So Pα [∼g∈X ·Ch(X, Y )=1·g∈Y ] = 0. Thus, Pα [g ∈ X|Ch(X, Y )=1 · g ∈ Y · ∼ g ∈ X] cannot be 0, but must instead be undefined, because the term Pα [∼g∈X · Ch(X, Y )=1 · g∈Y ] = 0 in the denominator of its definition has value 0. 10 This should not be confused with countable additivity. In this context countable additivity would require the object language in which scientific theories are expressed to possess a means of expressing infinitely long disjunctions of sentences. If our object language had such a connective, ‘∨{Bk :Bk ∈S} Bk ’ for countable sets S of sentences, then an axiom for countable additivity need only say that whenever, for all distinct Bi and Bj in a countable set S = {B1 , B2 , ...}, C ∼(Bi ·Bj ), then Pα [∨{Bk :Bk ∈S} Bk |C] = limn Pα [(B1 ∨ B2 ∨ . . P . ∨ Bn )|C]. It then follows ∞ from the derived rule stated above that Pα [∨{Bk :Bk ∈S} Bk |C] = i=1 Pα [Bi |C]. Here is a
Confirmation Theory
341
In the context of the logic of confirmational support it makes good sense to supplement the above rules with two more. Here’s the first one: 6. If A is an axiom of set theory or any other piece of pure mathematics employed by the sciences, or if A is analytically true (given the meanings that Pα presupposes for the terms in L), then, for all C, Pα [A|C] = 1. The idea is that the logic of confirmation is about evidential support for contingent claims. Nothing can count as empirical evidence against or for non-contingent derivation of the above rule (proved without appeal to countable additivity): Suppose that for each distinct Bi and Bj , C ∼(Bi · Bj ), and suppose Pα [D|C] < 1 for at least one sentence D. Notice that, for each i, C (∼(Bi ·Bi+1 )·. . .· ∼(Bi ·Bn )). This implies C ∼ (Bi ·(Bi+1 ∨ . . . ∨ Bn )). So for each finite listP of the Bi , Pα [(B1 ∨ n B2 ∨ . . . ∨ BP [(B ∨ . . . ∨ B )|C] = . . . = n )|C] = Pα [B1 |C] + Pα n 2 i=1 Pα [Bi |C]. By Pn P [B |C] = lim P [B |C]. Thus, lim P [(B1 ∨ B2 ∨ . . . ∨ definition,P ∞ α n α n α i i i=1 i=1 Bn )|C] = ∞ i=1 Pα [Bi |C].
I have no objection to countable additivity in general. But infinite disjunction doesn’t appear to be a natural part of the language in which scientific theories are expressed. That is, although there are well worked out extensions of predicate logic to infinitary languages, which include a means of expressing infinite disjunctions (see [Scott and Krauss, 1966]), no scientific theory I know of draws on the resources of such a language. So I won’t add such devices to the object language for the confirmation functions under consideration here. Regular first-order predicate logic does have a limited means of expressing infinite disjunctions via existential quantification. So one might consider adding a kind of countable additivity axiom (sometimes called the Gaifman Condition, after [Gaifman, 1962]) as follows: for each open expression Fx, Pα [∃xFx|B] = limn P [Fc1 ∨ . . . ∨ Fcn |B], where the individual constants c1 , . . . , cn , . . . , exhaust the countably infinite list of L’s individual constants. From this axiom the following form of countable additivity follows: if for each distinct ci and P cj , B ∼(Fci · Fcj ), then Pα [∃xFx|B] = ∞ i=1 Pα [Fcj |C]. However, the proposed axiom seems overly strong, since it effectively assumes that every individual object gets named, or at least that enough exemplars are named that the probability of the disjunction approaches the probability of the existential claim. This seems implausible in contexts where the number of individuals is uncountably infinite. If we don’t assume that the Gaifman Condition holds, the strongest claim we should want is this: Pα [∃xFx|B] ≥ limn Pα [Fc1 ∨ . . . ∨ Fcn |B]. But the other axioms already imply this, as follows: Pα [∃xFx|B] ≥ Pα [∃xFx|B] × Pα [(Fc1 ∨ ... ∨ Fcn )|B ·∃xFx] = Pα [(Fc1 ∨ ... ∨ Fcn )· ∃xFx|B] = Pα [∃xFx|B · (Fc1 ∨ ... ∨ Fcn )] × Pα [(Fc1 ∨ ... ∨ Fcn )|B] = Pα [(Fc1 ∨ ... ∨ Fcn )|B] (since Pα [∃xFx|B·(Fc1 ∨ ...∨ Fcn )] = 1, because B·(Fc1 ∨ ...∨ Fcn ) ∃xFx). So Pα [∃xFx|B] ≥ limn Pα [Fc1 ∨ . . . ∨ Fcn |B]. If, Pnin addition, for each pair of distinct ci and cj , B ∼(Fc P∞ i ·Fcj ), then Pα [(Fc1 ∨ ... ∨ Fcn )|B] = i=1 Pα [Fcj |C], so limn Pα [Fc1 ∨ . . . ∨ Fcn |B] = i=1 P Pα [Fcj |C]. Thus, if for each pair of distinct ci and cj , B ∼(Fci ·Fcj ), then Pα [∃xFx|B] ≥ ∞ i=1 Pα [Fcj |C]. One more point: Confirmation functions reside in the meta-language, where logical relationships usually reside, rather than in the object-language L, where scientific hypotheses live. So, even though countable additivity may not be a feature of confirmation functions themselves, scientific hypotheses (expressed in the object-language) may employ object-language probability functions defined on sets to represent chance processes in nature. These object-language probability functions may well satisfy a countable additivity rule for countable unions of disjoint sets. The existence of such countably additive object-language probabilities is perfectly compatible with the present approach to confirmational probability.
342
James Hawthorne
truths. They should be maximally confirmed relative to each possible statement. An important respect in which the logic of confirmation functions should follow the deductive paradigm is in not presupposing the truth-values of contingent sentences. For, the whole idea of a logic of confirmation is to provide a measure of the extent to which contingent premise sentences indicate the likely truth-values of contingent conclusion sentences. But this idea won’t work properly if the truthvalues of some contingent sentences are presupposed by the confirmation function. Such presuppositions may hide significant premises, making the logic of confirmation enthymematic. Thus, for example, no confirmation function Pα should permit a tautological premise to assign degree of confirmation 1 to a contingent claim: Pα [C|B∨ ∼B] should always be less than 1 when C is contingent. However, it is common practice for probabilistic logicians to sweep provisionally accepted contingent claims under the rug by assigning them probability 1. This saves the trouble of repeatedly writing a given contingent sentence B as a premise, since Pγ [A|B · C] will just equal Pγ [A|C] whenever Pγ [B|C] = 1. Although this device is useful, such functions should be considered mere abbreviations of proper, logically explicit, non-enthymematic, confirmational relationships. Thus, properly speaking, a confirmation function Pα should assign probability 1 to a sentence on every possible premise only if that sentence is either (i) logically true, or (ii) an axiom of set theory or some other piece of pure mathematics employed by the sciences, or (iii) that sentence is analytic according to the meanings of terms in the language presupposed by confirmation function Pα , and so outside the realm of evidential support. Thus, it is natural to adopt the following version of the so-called “axiom of regularity”. 7. If A is not a consequence of set theory or some other piece of pure mathematics employed by the sciences, and is neither a logically nor an analytically true statement (given the meanings of the terms of L as represented in Pα ), then Pα [A| ∼A] < 1.11 Taken together with axiom 6, this axiom tells us that a confirmation function Pα counts as non-contingently true just those sentences that it assigns probability 1 on every possible premise.12 Some logicist Bayesians have proposed that the logic of confirmation might be made to depend solely on the logical form of sentences, just like deductive logic. The idea is, effectively, to supplement axioms 1–7 with additional axioms that depend only on the logical structures of sentences, and to add enough such axioms to reduce the number of possible confirmation functions to a single unique function. It is now widely agreed that this project cannot be carried out in a plausible way. Perhaps there are additional rules that should be added to 1–7. But it is doubtful 11 It follows from the other axioms that when P [A| ∼A] < 1, P [A| ∼A] = 0: for, ∼A α α (A∨ ∼A), so 1 = Pα [A∨ ∼A| ∼A] = Pα [A| ∼A] + Pα [∼A| ∼A] = Pα [A| ∼A] + 1. 12 Because, P [A|C] = 1 for all C if and only if P [A| ∼A] = 1. So, taken together with axiom α α 7, ‘Pα [A|C] = 1 for all C’ implies that A must be either a consequence of set theory or some other piece of pure mathematics, or it must be logically or analytically true.
Confirmation Theory
343
that such rules can suffice to specify a single, uniquely qualified confirmation function based only on the formal structure of sentences. I’ll say more about why this is doubtful a bit later, after we first see how confirmational probabilities capture the important relationships between hypotheses and evidence. 2
TWO CONCEPTIONS OF CONFIRMATIONAL PROBABILITY
Axioms 1–7 merely place formal constraints on what may properly count as a probabilistic confirmation function. Each function Pα that satisfies these rules may be viewed as a possible way of specifying a confirmation function that respects the meanings of the logical terms, much as each possible truth-value assignment for a language represents a possible way of assigning truth-values to its sentences in a way that respects the semantic rules expressing the meanings of the logical terms ’not’, ’and ’, ’or ’, etc. The issue of which of the possible truth-value assignments to sentences of a language represents the actual truth or falsehood of its sentences depends on more than this — it depends on the meanings of the non-logical terms and on the state of the actual world. Similarly, the degree to which some sentences actually support others in a fully meaningful language must rely on more than the mere satisfaction of these axioms for confirmation functions. It must at least rely on what the sentences of the language mean, and perhaps on much more besides. What more? Perhaps an interpretation of probability appropriate to the confirmation functions may shed some light on this issue by filling out our conception of what confirmational probability is really about. I’ll briefly describe two prominent views.13 One reading, a kind of logicist reading of confirmation functions, is to take each function Pα to be associated with a meaning assignment to sentences, and to take these interpreted sentences to represent possible states of affairs, or ways the world might turn out to be, or possible worlds. Then a confirmation measure Pα is effectively a measure on (expressible) sets of logically possible states of affairs. The idea is that, given a fully meaningful language, ‘Pα [A|B] = r’ says that among the states in which B is true, A is true in “proportion” r of them. The relevant sets of possible states will usually contain infinitely many states, so the measure on sets of states supplied by Pα is not “intrinsic to the sets of possibilities themselves”, but rather is imposed by a particular way of measuring (by a particular measure function on) the space of logical possibilities. Thus, each confirmation function imposes its own way of measuring possible states, and generally there may be any number of respectable ways to assign meanings to sentences and measures on possible states. So, when an agent chooses to employ a particular confirmation function (or a particular collection of them), she is effectively choosing a way of measuring possibilities — a way that reflects how she understands the meanings and inferential implications of sentences in her language. If we read confirmation functions as measures on possible states, then axioms 13 [H´ ajek,
2003b] provides a helpful discussion of the various interpretations of probability.
344
James Hawthorne
1-7 turn out to be plausible constraints on such measures. Each axiom is a rule that is automatically satisfied by finite proportions — so these axioms extend the logic of finite proportions to “measures of proportionality” on infinite sets. However, perhaps not every function that satisfies these axioms represents a good way to measure confirmation. That is, these axioms may not be jointly sufficient to pick out “proper” confirmation functions. What further restrictions should confirmation functions satisfy? As we see how the confirmation of scientific hypotheses is supposed to work in a Bayesian confirmation theory, perhaps some plausible additional restrictions will become apparent. So let’s put that question aside for now. In any case, I hope that the basic idea behind this kind of logicist reading of the confirmation functions is clear enough. I invite you to see how well this reading fits with how Bayesian confirmation theory is supposed to work, as explicated in subsequent sections of this article. Subjectivist Bayesians offer an alternative reading of the confirmation functions. First, they usually take unconditional probabilities as basic, and they take conditional probabilities as defined in terms of them. Furthermore, subjectivist Bayesians take each unconditional probability function Pα to represent the belief strengths or confidence strengths of an ideally rational agent, α. On this understanding ‘Pα [A] = r’ says, “the strength of α’s belief (or confidence) that A is true is r.” Subjectivist Bayesians usually tie such belief strengths to what the agent would be willing to bet on A being true. Roughly, the idea is this. Suppose that an ideally rational agent α is willing to accept a wager that would yield (at least) $u if A turns out to be true and would lose him $1 if A turns out to be false. Then, under reasonable assumptions about his desires for small amounts of money, it can be shown that his belief strength that A is true should be Pα [A] = 1/(u + 1). And it can further be shown that any function Pα that expresses such bettingrelated belief strengths on all statements in agent α’s language must satisfy the usual axioms for unconditional probabilities.14 Moreover, it can be shown that any function Pβ that satisfies these axioms is a possible rational belief function for some ideally rational agent β. Such relationships between belief strengths and the desirability of outcomes (e.g., gains in money or goods on bets) are at the core of Bayesian decision theory.15 Subjectivist Bayesians usually take confirmational probability to just be this notion of probabilistic belief strength.16 Undoubtedly real agents do believe some claims more strongly than others. And, arguably, the belief strengths of real agents can be measured on a probabilistic scale between 0 and 1, at least approximately. And clearly the confirmational support of evidence for a hypothesis should influence the strength of an agent’s 14 Note
8 lists these axioms. alternative, but conceptually similar approach is to set down intuitively plausible constraints on the notion of rational preference among acts (or their outcomes), and then show that any such notion of preference can be represented (uniquely) by a probabilistic belief function together with a utility function, and that preferred acts (or their outcomes) are just those that maximize expected utility. 16 For various versions of this approach see [Ramsey, 1926; de Finetti, 1937; Savage, 1954; Jeffrey, 1965; Skyrms, 1984; Joyce, 1999]. [H´ ajek, 2005] provides a perceptive analysis. 15 An
Confirmation Theory
345
belief that the hypothesis is true. However, there is good reason for caution about taking confirmation functions to be Bayesian belief strength functions, as we will see later. So, perhaps an agent’s confirmation function is not simply identical to his belief function, and perhaps the relationship between confirmation and belief strength is somewhat more complicated than the subjective Bayesian supposes. In any case, some account of what confirmation functions are supposed to represent is clearly needed. The belief function account and the possible states (or possible worlds) account are two attempts to provide this. I’ll now put this interpretative issue aside until later (section 4). We’ll return to trying to get a better handle on what probabilistic confirmation functions really are after we take a careful look at how the logic that draws on them is supposed to work. 3
THE LOGICAL STRUCTURE OF EVIDENTIAL SUPPORT AND THE ROLE OF BAYES’ THEOREM IN THAT LOGIC
A theory of confirmation should explicate the notion of evidential support for all sorts of scientific hypotheses, ranging from simple diagnostic claims (e.g., the patient has pneumonia) to scientific theories about the fundamental nature of the world, like quantum mechanics or the theory of relativity. We’ll now look at how the logic of probabilistic confirmation functions draws on Bayes’ Theorem to bring evidence to bear, via the likelihoods, on the refutation or support of scientific hypotheses. To begin with, consider some exhaustive set of mutually incompatible hypotheses or theories about some common subject matter, {h1 , h2 , . . .}. The set of alternatives may consist of a simple pair of alternatives — e.g., {“the patient has pneumonia”, “the patient doesn’t have pneumonia”}. Or it may consist of a long list of possible alternatives, as is the case when the physician tries to determine which among a range of diseases is causing the patient’s symptoms. For the cosmologist the alternatives may consist of several alternative theories of the structure and dynamics of space-time, and may include various “versions of the same theory”. Where confirmation theory is concerned, even a slightly different (logically distinct) version of a given theory will count as a distinct theory, especially if it differs from the original in empirical import. In principle there may be finitely or infinitely many alternative hypotheses under consideration. They may all be considered at once, or they may be constructed and assessed over many centuries. One may even take the set of alternative hypotheses to consist of all logically possible alternatives expressible in a given language about a given subject matter — e.g., all possible theories of the origin and evolution of the universe expressible in English and mathematics. Although testing every logically possible alternative poses practical challenges, the logic works much the same way in this logically ideal case as it does in more practical cases. The set of alternative hypotheses may contain a catch-all hypothesis hK that says that none of the other hypotheses are true — e.g., “the patient has none of the known diseases”. When only a finite number u of explicit alternative hypotheses
346
James Hawthorne
is under consideration, hK will be equivalent to the sentence that denies each explicitly specified alternative: (∼h1 · . . . · ∼hu ) Evidence for scientific hypotheses comes from specific experiments or observations. For a given experiment or observation, let ‘c’ represent a description of the relevant experimental or observational conditions under which the evidence is obtained, and let ‘e’ represent a description of the evidential outcome that comes about under conditions c. Scientific hypotheses often require the mediation of background knowledge and auxiliary hypotheses to help them express claims about evidence. Let ‘b’ represent all background and auxiliary hypotheses not at issue in the assessment of the hypotheses hi , but that mediate their implications about evidence. In cases where a hypothesis is deductively related to evidence, either hi·b·c e or hi·b·c ∼ e (i.e., either hi ·b·c logically entails e or it logically entails ∼ e). For example, hi might be the Newtonian Theory of Gravitation. A test of the theory may involve conditions described by a statement c about how measurements of Jupiter’s position are made at various times; the outcome description e states the results of each position measurement; and the background information (or auxiliary hypotheses) b may state some trustworthy (already well confirmed) theory about the workings and accuracy of the devices used to make the position measurements. If outcome e can be calculated from the theory hi together with b and c, we have that hi ·b·c e. The so-called hypothetico-deductive account of confirmation says that in such cases, if (c·e) actually occurs, this may be considered good evidence for hi , given b. On the other hand, if from hi ·b·c we calculate some outcome incompatible with e, then we have hi ·b·c ∼ e. In that case, from deductive logic alone we get that b·c·e ∼ hi , and hi is said to be falsified by b·c·e. Duhem [1906] and Quine [1953] deserve credit for alerting inductive logicians to the importance of auxiliary hypotheses. They point out that scientific hypotheses often make little contact with evidence claims on their own. So, generally speaking, the evidence can only falsify hypotheses relative to background or auxiliary hypotheses that tie them to that evidence. However, the auxiliaries themselves will usually be testable on some separate body of evidence in much the same way that the hypotheses {h1 , h2 , . . .} are tested. Furthermore, when we are not simply interested in assessing the hypotheses {h1 , h2 , . . .} relative to a specific package of auxiliaries b, but instead want to consider various alternative packages of auxiliary hypotheses, {b1 , b2 , . . .}, as well, the set of alternative hypotheses to which the logic of confirmation applies should be the various possible combinations of original hypotheses in conjunction with the possible alternative auxiliaries, {h1 ·b1 , h1 ·b2 , ..., h2 ·b1 , h2 ·b2 , ..., h3 ·b1 , h3 ·b2 , . . .}. When this is the case, the logic of confirmation will remain the same. The only difference is that the hypotheses ‘hi ’ under discussion below should be taken to stand for the complex conjunctive hypotheses of form (hj ·bv ), and ‘b’ in our discussion below should stand for whatever remaining, common auxiliary hypotheses are not at issue. In the most extreme case, where each hypothesis at issue includes within itself all relevant auxiliaries, the term ‘b’ may be empty — i.e. we may take it to be some simple tautology.
Confirmation Theory
347
In probabilistic confirmation theory the degree to which a hypothesis hi is supported or confirmed on evidence c·e, relative to background b, is represented by the posterior probability of hi , Pα [hi |b·c·e]. It turns out that the posterior probability of a hypothesis depends on two kinds of factors: (1) its prior probability, Pα [hi |b], together with the prior probabilities of its competitors, Pα [hj |b], etc....; and (2) the likelihood of evidential outcomes e according to hi (give that b and c are true), P [e|hi ·b·c], together with the likelihoods of the outcomes according to hi ’s competitors hj , P [e|hj ·b·c], etc., .... I’ll now examine each of these two kinds of factors more closely. Then I’ll discuss how the values of posterior probabilities depend on them.
3.1
Likelihoods
Hypotheses express their empirical import via likelihoods, which are confirmation function probabilities of form P [e|hi ·b·c].17 A likelihood expresses how likely it is that outcome e will occur according hypothesis hi under conditions c, supposing that auxiliaries b hold.18 If a hypothesis together with auxiliaries and observation conditions deductively entails an evidence claim, the probability axioms make the corresponding likelihood objective in the sense that every confirmation function must agree on its values: i.e., for all confirmation functions Pα , Pα [e|hi ·b·c] = 1 if hi · b · c e, and Pα [e|hi · b · c] = 0 if hi · b · c ∼ e. However, in many cases the hypothesis hi will not be deductively related to the evidence, but will only imply it probabilistically. There are at least two ways this might happen. Either hi may itself be an explicitly probabilistic or statistical hypothesis, or there may be an auxiliary statistical hypothesis in the background b that connects hi to the evidence. For the sake of clarity let’s briefly consider examples of each. A blood test for HIV has a known false-positive rate and a known true-positive rate. Suppose the false positive rate is .05 — i.e., the test may be expected to incorrectly show the blood sample to be positive for HIV in about 5% of all cases where no HIV is present. And suppose the true-positive rate is .99 — i.e., the blood test may be expected to correctly show the blood sample to be positive for HIV in about 99% all cases where HIV really is present. When a particular patient’s blood is tested, the hypotheses under consideration are ‘the patient is infected with HIV’, h, and ‘the patient is not infected with HIV’, ∼ h. In this context the known test characteristics play the role of background information, b. The experimental condition c merely states that this patient was subjected to this particular blood test for HIV, which was processed by the lab in the usual way. Let us suppose that 17 Presentations of the logic of confirmation often suppress c and b, and simply write ‘P [e|h]’. But c and b are important to logic of the likelihoods. So I’ll continue to make them explicit. 18 Bayesians often refer to the probability of an evidence statement on a hypothesis, P[e|h·b·c], as the likelihood of the hypothesis. This terminology was introduced by R.A. Fisher [1922], who treated likelihoods as functions on the space of possible hypotheses, which he took to be measures of how strongly the evidence supports hypotheses. This can be a somewhat confusing way of talking, since it is clearly the evidence that is made likely to whatever degree by a hypothesis. So I’ll avoid this usual (but unusual) way of talking about likelihoods.
348
James Hawthorne
the outcome e states that the result is positive for HIV. The relevant likelihoods, then, are P [e|h · b · c] = .99 and P [e| ∼h · b · c] = .05. In this example the values of the likelihoods are entirely due to the statistical characteristics of the accuracy of the test, which is carried by the background information b. The hypothesis h being tested is not itself statistical. This kind of situation may, of course, arise for much more complex hypotheses. The hypothesis of interest may be some deterministic physical theory, say Newtonian Gravitation Theory. Some of the experiments that test this theory may relay on imprecise measurements that have known statistical error characteristics, which are expressed as part of the background or auxiliary hypotheses b. For example, the auxiliary b may describe the error characteristics of a device that measures the torque imparted to a quartz fiber, used to assess the strength of the gravitational force between test masses. In that case b may say that for this kind of device the measurement errors are normally distributed about whatever value a given gravitational theory predicts, with some specified standard deviation that is characteristic of the device. This results in specific likelihood values, P [e|hi · b · c] = ri , for each of the various alternative gravitational theories hi being tested. Alternatively, the hypotheses being tested may themselves be statistical in nature. One of the simplest examples of statistical hypotheses and their role in likelihoods consists of hypotheses about the chance-characteristics of coin-tossing. Let h[r] be a hypothesis that says a specific coin has a propensity r for coming up heads on normal tosses, and that all such tosses are probabilistically independent of one another. Let c state that the coin is tossed n times in the usual way; and let e state that on this specific run of n tosses the coin comes up heads m times. In cases like this the value of the likelihood of the outcome e under conditions c on hypothesis h[r] is well-known: P [e |h[r] · b · c] = [n!/(m!(n − m)!)] rm (1 − r)n−m . There are, of course, more complex cases of likelihoods involving statistical hypotheses. Consider the hypothesis that plutonium 233 nuclei have a half-life of 20 minutes — i.e., the propensity for a Pu-233 nucleus to decay within a 20 minute period is 1/2. This hypothesis, h, together with background b about decay products and the efficiency of the equipment used to detect them (which may itself be an auxiliary statistical hypothesis), yields precisely calculable values for likelihoods P [e |h · b · c] of possible outcomes of the experimental arrangement. Likelihoods that arise from explicit statistical claims — either within the hypotheses being tested, or from statistical background claims that tie the hypotheses to the evidence — are sometimes called direct inference likelihoods. Such likelihoods are generally taken to be completely objective.19 So all reasonable confirma19 If you have doubts about the objectivity of direct inference likelihoods, I refer you to David Lewis’s Questionnaire near the beginning of “A Subjectivist’s Guide to Objective Chance” (1980). In that paper Lewis argues for the objectivity of direct inference likelihoods based on chance statements. Lewis’s Principal Principle expresses his version of the principle governing these likelihoods. Indeed, Lewis thinks that this kind of principle about direct inference likelihoods captures “all we know about chance.” (p. 266) He takes such likelihoods to be correct, regardless of what the right view might be about the metaphysical basis of chance and of confirmation probability function (which he calls credence function). “I shall not attempt to decide between
Confirmation Theory
349
tion functions should be required to agree on their values, just as all confirmation functions agree on likelihoods when evidence is logically entailed. In other words, those probability functions Pα that satisfy our axioms but get the direct inference likelihoods wrong should be regarded as failing to represent proper confirmation functions. Direct inference likelihoods may be thought of as logical in an extended, non-deductive sense. Indeed, some logicians have attempted to spell out the logic of direct inferences in terms of the logical form of the sentences involved.20 If that project can be made to work (for Bayesian confirmation functions), then the axioms for probabilistic confirmation functions (in section 1) should be supplemented with additional axioms that capture the logic of the direct inference likelihoods. But regardless of whether that formal project can be made to work, it seems reasonable to take likelihoods that derive from explicit statistical claims to have objective or intersubjectively agreed values, and to disregard any probability function that gets this wrong as failing to represent a proper confirmation function.21 the Humean and the anti-Humean variants of my approach to credence and chance. The Principal Principle doesn’t.” (Final two lines of the paper, p. 292.) 20 Attempts to spell out the logic of direct inference likelihoods for Bayesian confirmation functions in purely formal terms (i.e. in terms of logical structure alone) have not been wholly satisfactory thus far, but research continues. For an illuminating discussion of the logic of direct inference and the difficulties involved in providing a purely formal account, see the series of papers [Levi, 1977; Kyburg, 1978; Levi, 1978]. 21 In several places I’ve drawn on coin tossing as an example of a chancy process, and as having propensities towards heads. I sometimes encounter the following complaint about this treatment: It is reasonable to think that the propensities for single outcomes in macro-physical systems are all either 1 or 0. Only in quantum mechanics do there seem to be nonextreme single case propensities. So those antecedents of direct inference likelihoods that attribute propensities to macro-systems, like coin tossing systems, are (most likely) literally false. This objection is not just nit-picking. The point is that there may be very few real non-deductive cases of direct inference likelihoods (except in microphysics). It seems to me that two responses are in order. (1) As in the case of deductive logic, the falsity of the premise has nothing to do with the validity of the inference. I take this to be the case for probabilistic direct inferences as well. (2) But perhaps the issue is that if the hypothesis (e.g. about coin propensities, or about any other macro-system) is literally false, and if we know that it’s false, then even though the likelihood based on it may “make sense”, we shouldn’t want to try to confirm such a hypothesis. So there are few instances of true chance claims that figure in direct inference likelihoods in the context of a confirmation theory. This assessment seems correct, up to a point. But I think there is an important caveat. In the sciences we often use “literally false claims” in models that give good approximations. I think we do exactly that when we treat systems that we take to be non-chancy (e.g. coin tossing) as though they are chance processes. We model the system as though it is “truly chancy” (has real non-extreme propensities), and within that framework we test to find which chance model best captures the phenomena at the level of detail we are using to describe it. Understood this way, there is indeed a literally true hypothesis in the neighborhood that we may wish to confirm — i.e. a hypothesis of the sort that says: for coins of this sort (e.g. bent in this particular way), tossed in the usual way, coin tossing is more accurately modeled as though it is a chance mechanism for which the chances of heads is (say) 57/100 than by any other model of a similar sort — e.g. than any other chance model that assigns the chances of heads (according to
350
James Hawthorne
Not all likelihoods of interest in confirmational contexts are warranted deductively or by explicitly stated statistical claims. Nevertheless, the likelihoods that relate hypotheses to evidence in scientific contexts should often have objective or intersubjectively agreed values. So, although we may need to draw on a range of different functions, Pα , Pβ , . . ., Pγ , etc., to represent the confirmation functions employed by various members of a scientific community (due to the different values they assign to prior probabilities), all should agree, at least approximately, on the values of the likelihoods. For, likelihoods represent the empirical content of a hypothesis, what the hypothesis (together with background b) probabilistically implies about the evidence. Indeed, the empirical objectivity of a science relies on a high degree of objectivity or intersubjective agreement among scientists on the values of likelihoods. To see the point more vividly, imagine what a science would be like if scientists disagreed widely about the values of likelihoods. Each practitioner interprets a theory to say quite different things about how likely it is that various possible evidence statements will turn out to be true. Whereas scientist α takes theory h1 to probabilistically imply that event e is highly likely, his colleague β understands the empirical import of h1 to say that e is very unlikely. And whereas α takes competing theory h2 to probabilistically imply that e is quite unlikely, his colleague β reads h2 as saying that e is very likely. So, for α the outcome e supplies strong support for h1 over h2 , because Pα [e|h1 ·b·c] >> Pα [e|h2 ·b·c]. But his colleague β takes outcome e to show just the opposite — that h2 is strongly supported over h1 — because Pβ [e|h1·b·c] Pα [en |hj · b · cn ]. The ratio of likelihoods Pα [en |hi ·b·cn ]/Pα [en |hj ·b·cn ] measures the strength of the evidence for hi over hj given b.27 The Law of Likelihood says that the likelihood ratios represent the total impact of the evidence. Bayesians agree with this, but take prior probabilities to also play a role in the net assessment of confirmation, as represented by the posterior probabilities. So, for Bayesians, even when the strength of the evidence, Pα [en |hi ·b·cn ]/Pα [en |hj ·b·cn ], is be very high, strongly favoring hi over hj , the net degree of confirmation of hi may be much smaller than that of hj if hi is taken to be much less plausible than hj on grounds not captured by this evidence (where the weight of these additional considerations is represented by the confirmational prior probabilities of hypotheses). Two features of the way the General Law of Likelihood is stated here need some explanation. As stated, this law does not presuppose that likelihoods of form Pα [en |hj · b · cn ] and Pα [en |hi · b · cn ] are always defined. This qualification is introduced to accommodate a conception of evidential support called Likelihoodism, which I’ll say more about in a moment. Also, the likelihoods in the law are expressed with the subscript α attached, to indicate that the law holds for each confirmation function Pα , even if the values of the likelihoods are not completely objective or agreed on by a given scientific community. These two features of the law both involve issues concerning the objectivity of the likelihoods. Each confirmation function (each function that satisfies the axioms of section 1) is defined on every pair of sentences. So, the likelihoods are always defined for a given confirmation function. Thus, for a Bayesian confirmation theory the qualifying clause about the likelihoods being defined is automatically satisfied. Furthermore, for confirmation functions the versions of Bayes’ theorem (Equations 8-11) hold even when the likelihoods are not objective or intersubjectively agreed. When intersubjective agreement on likelihoods may fail, we leave the subscripts α, β, etc. attached to the likelihoods to indicate this possible lack of objective agreement. Even so, the General Law of Likelihood applies to the confirmation function likelihoods taken one confirmation function at a time. For each confirmation function, the impact of the evidence in distinguishing between hypotheses is completely captured by the likelihood ratios. A view (or family of views) called likelihoodism maintains that confirmation theory should only concern itself with how much the evidence supports one hypothesis over another, and maintains that evidential support should only involve ratios of completely objective likelihoods. When the likelihoods are objective, their ratios provide an objective measure of how strongly the evidence supports hi as compared to hj , one that is “untainted” by such subjective elements as prior plausibility considerations. According to likelihoodists, objective likelihood ratios 27 The General Law is sometimes presented (by likelihoodists) in a stronger form — a form that adds that nothing but likelihood ratios is relevant to the scientific evaluation of hypotheses. In that form it is an explicitly anti-Bayesian thesis.
Confirmation Theory
359
are the only scientifically appropriate way to assess what the evidence says about hypotheses. Likelihoodists need not reject Bayesian confirmation theory altogether. Many are statisticians and logicians who hold that the logical assessment of the evidential impact should be kept separate from other considerations. They often add that the only job of the statistician/logician is to evaluate the objective strength of the evidence. Some concede that the way in which these objective likelihoods should influence the agents’ posterior confidence in the truth of a hypothesis may depend on additional considerations — and that perhaps these considerations may be represented by individual subjective prior probabilities for agents in the way Bayesians suggest. But such considerations go beyond the impact of the evidence. So it’s not the place of the statistician/logician to compute recommended values of posterior probabilities for the scientific community.28 For most pairs of sentences conditional probabilities fail to be objectively defined in a way that suits likelihoodists. So, by their lights the logic of confirmation functions (captured by the axioms of section 1) cannot represent an objective logic of evidential support. Because of this, likelihoodists do not have Bayes’ theorem available (except in special cases where an objective probability measure on the hypothesis space is available), and so cannot extract the Law of Likelihood from it (as do Bayesians via Equations 9-11). Rather, likelihoodists must state the Law of Likelihood as an axiom of their logic of evidential support, an axiom that (for them) applies only when likelihoods have well-defined objective values. Likelihoodists tend to have a very strict conception of what it takes for likelihoods to be well-defined. They consider a likelihood well-defined only when it has the form of what we referred to earlier as a direct inference likelihood — i.e., only when either, (1) the hypothesis (together with background and experimental conditions) logically entails the evidence claim, or (2) the hypothesis (together with background conditions) logically entails an explicit simple statistical hypothesis that (together with experimental conditions) specifies precise probabilities for each type of event that makes up the evidence. Likelihoodists make a point of contrasting simple statistical hypotheses with composite statistical hypotheses, which only entail imprecise, or disjunctive, or directional claims about the statistical probabilities of evidential events. A simple statistical hypothesis might say, for example, “the chance of heads on tosses of the coin is precisely .65”; a composite statistical hypothesis might say, “the chance of heads on tosses is either .65 or .75”, or it may be a directional hypothesis that says, “the chance of heads on tosses is greater than .65”. Likelihoodists maintain that composite hypotheses are not an appropriate basis for well-defined likelihoods, because such hypotheses represent a kind of disjunction of simple statistical hypotheses, and so must depend on non-objective factors — i.e. they must depend on the prior probabilities of the various hypotheses in the disjunction. For example, “the chance of heads on tosses is either .65 or .75”, is a disjunction of the two simple statistical hypotheses h.65 and h.75 . From the axioms of probability 28 Royall
[1997] expresses a view of this sort.
360
James Hawthorne
theory it follows that the likelihood of any specific sequence of outcomes e from appropriate tosses c is given by Pα [e|c·(h.65 ∨ h.75 )] = (P [e|c·h.65 ]Pα [h.65 |c] + P [e|c·h.75 ]Pα [h.75 |c])/(Pα [h.65 |c] + Pα [h.75 |c]) where only the likelihoods based on simple hypotheses (those from which I have dropped the ‘α’) are completely objective. Thus, likelihoods based on disjunctive hypotheses depend (at least implicitly) on the prior probabilities of the simple statistical hypotheses involved; and likelihoodists consider such factors to be too subjective to be permitted a role in a logic that is supposed to represent only the impact of the evidence. Taking all of this into account, the version of the Law of Likelihood appropriate to likelihoodists may be stated as follows. Special Law of Likelihood: Given a pair of incompatible hypotheses hi and hj that imply statistical models regarding outcomes en given b·cn , the likelihoods P [en |hj ·b·cn ] and P [en |hi · b · cn ] are well defined. For such likelihoods, the evidence (cn · en ) supports hi over hj , given b, if and only if P [en | hi · b · cn ] > P [en | hj · b · cn ]; the ratio of likelihoods P [en | hi · b · cn ] / P [en |hj · b · cn ] measures the strength of the evidence for hi over hj given b. Notice that when either version of the Law of Likelihood holds, the absolute size of any particular likelihood is irrelevant to the strength of the evidence. All that matters is the relative size of the likelihoods — i.e., the size of their ratio. Here is a way to see the point. Let c1 and c2 be the conditions for two different experiments having outcomes e1 and e2 , respectively. Suppose that e1 is 1000 times more likely on hypothesis hi (given b · c1 ) than is e2 on hi (given b · c2 ); and suppose that e1 is also 1000 times more likely on hj (given b · c1 ) than is e2 on hj (given b · c2 ) — i.e., suppose that Pα [e1 |hi · b · c1 ] = 1000 × Pα [e2 |hi · b · c2 ], and Pα [e1 |hj · b · c1 ] = 1000 × Pα [e2 |hj · b · c2 ]. Which piece of evidence, (c1 · e1 ) or (c2 · e2 ), is stronger evidence with regard to the comparison of hi to hj ? The Law of Likelihood implies both are equally strong. All that matters evidentially are the ratios of the likelihoods, and they are the same in this case: Pα [e1 |hi ·b·c1 ]/Pα [e1 |hj ·b·c1 ] = Pα [e2 |hi ·b·c2 ]/Pα [e2 |hj ·b·c2 ]. Thus, the General Law of Likelihood implies the following principle. General Likelihood Principle: Suppose two different experiments or observations (or two sequences of them) c1 and c2 produce outcomes e1 and e2 , respectively. Let {h1 , h2 , . . . } be any set of alternative hypotheses. If there is a constant r such that for each hypothesis hj from the set, Pα [e1 |hj · b · c1 ] = r×Pα [e2 |hj ·b·c2 ], then the evidential import of (c1 ·e1 ) for distinguishing
Confirmation Theory
361
among hypotheses in the set (given b) is precisely the same as the evidential import of (c2 · e2 ). Similarly, the Special Law of Likelihood implies a corresponding Special Likelihood Principle that applies only to hypotheses that express simple statistical models.29 Bayesians agree with likelihoodists that likelihood ratios completely characterize the extent to which the evidence favors one hypothesis over another (as shown by Equations 9-11). So they agree with the letter of the Law of Likelihood and the Likelihood Principle. Furthermore, Bayesian confirmationists may agree that it’s important to keep likelihoods separate from other factors, such as prior probabilities, in scientific reports about the evidence. However, Bayesians go further than most likelihoodists in finding a legitimate role for prior plausibility assessments to play in the full evaluation of scientific hypotheses. They propose to combine a measure of the impact of evidence (couched in terms of ratios of likelihoods) with a measure of the plausibility of hypotheses based on all other relevant factors (couched in terms of ratios of prior probabilities) to yield a probabilistic measure of the net confirmation of each hypothesis (its posterior probability). Throughout the remainder of this article I will not assume that likelihoods must be based on simple statistical hypotheses, as likelihoodist would have them. However, most of what will be said about likelihoods, including the convergence results in section 5 (which only involved likelihoods), applies to the likelihoodist conception of likelihoods as well. We’ll continue for now to take the likelihoods with which we are dealing to be objective in the sense that all members of the scientific community agree on their numerical values. In section 6 we’ll see how to extend this approach to cases where the likelihoods are less objectively determinate.
3.5
The Representation of Vague and/or Diverse Prior Probabilities
Given that a scientific community should largely agree on the values of the likelihoods for hypotheses, any significant disagreement regarding the posterior probabilities of hypotheses should derive from disagreements over prior probabilities. The community is diverse with respect to the range of confirmational probability functions they employ. Furthermore, individual agents may not be able to specify precisely how plausible they consider a hypothesis to be; so their prior probabilities for hypotheses may be vague. Both diversity due to disagreement among agents and vagueness for each individual agent can be represented by sets of confirmation functions, {Pβ , Pδ , . . . }, that agree on the likelihoods, but encompass a range of values for the prior plausibilities of hypotheses. Diversity and vagueness are different issues, but they may be represented in much the same way. We consider each in turn. 29 The
Law of Likelihood and the Likelihood Principle have been formulated in somewhat different ways by various logicians and statisticians. R.A. Fisher [1922] argued for the Likelihood Principle early in the 20th century, though he didn’t call it that. One of the first places it is discussed under that name is in [Savage et al., 1962]. The Law of Likelihood was first called by that name by Hacking [1965], and has been invoked more recently by the likelihoodist statisticians A.F.W. Edwards [1972] and R. Royall [1997].
362
James Hawthorne
An individual’s assessments of the non-evidential plausibilities of hypotheses will often be vague — not subject to the kind of precise quantitative treatment that a probabilistic logic of confirmation may seem to require for prior probabilities. So it is sometimes objected that the kind of assessment of prior probabilities required to get the Bayesian appraisal going cannot be had in practice. Bayesian confirmation theory has a way of addressing this worry. An agent’s vague assessments of prior plausibilities may be represented by a collection of probability functions, a vagueness set, which covers the range of plausibility values that the agent finds acceptable. Notice that if accumulating evidence drives the likelihood ratios to extremes, the range of functions in the agent’s vagueness set will be driven to near agreement, near 0 or 1, for values for posterior probabilities of hypotheses. Thus, as evidence accumulates, the agent’s vague initial plausibility assessments may transform into quite sharp posterior probabilities that indicate the strong refutation or support of the various hypotheses. Intuitively this seems like quite a reasonable way for the logic to work. The various agents in a community may widely disagree over the non-evidential plausibilities of hypotheses. Bayesian confirmation theory may represent this kind of diversity across the community of agents as a collection of all functions in the agents’ vagueness sets. Let’s call such a collection a diversity set. So, while there may well be disagreement among agents regarding the prior plausibilities of hypotheses, and while individual agents may only have vague priors, the logic of probabilistic confirmation may readily represent this feature. Furthermore, if accumulating evidence drives the likelihood ratios to extremes, the range of functions in a diversity set will come to near agreement on sharp values, near 0 or 1, for posterior probabilities of hypotheses. So, not only can such evidence firm up each agent’s vague initial plausibilities, it also brings the whole community into agreement on the near refutation of some alternative hypotheses and on the strong support of others. Under what conditions might the likelihood ratios go to such extremes as evidence accumulates, effectively washing out vagueness and diversity? The Likelihood Ratio Convergence Theorem (discussed in detail in section 5) implies that if a true hypothesis disagrees with false alternatives on the likelihoods of possible outcomes for a long enough (or a strong enough) stream of experiments or observations, then that evidence stream will very probably produce actual outcomes that drive the likelihood ratios of false alternatives as compared to the true hypothesis to approach 0. As this happens, almost any range of prior plausibility assessments will be driven to agreement on the posterior plausibilities for hypotheses. Thus, the accumulating evidence will very probably bring all confirmation functions in the vagueness and diversity sets for a community of agents to near agreement on posterior plausibility values — near 0 for the false competitors, and near 1 for the true hypothesis. One more point about prior probabilities and Bayesian convergence is worth noting. Some subjectivist versions of Bayesianism seem to suggest that an agent’s prior plausibility assessments for hypotheses should stay fixed once and for all, and
Confirmation Theory
363
that the only plausibility updating should be brought about via the likelihoods in accord with Bayes’ Theorem. Critics argue that this is unreasonable. The members of a scientific community may quite legitimately revise their “prior”, non-evidential plausibility assessments for hypotheses from time to time as they rethink plausibility arguments and bring new considerations to bear. This seems a natural part of the conceptual development of a science. It turns out that such reassessments of priors poses no difficulty for Bayesian confirmation theory. Reassessments may sometimes come about by the addition of explicit statements that supplement or modify the background information b. Or they may take the form of (non-Bayesian) transitions to new vagueness sets for individual agents and to new diversity sets for the community. The logic of Bayesian confirmation theory places no restrictions on how values for prior plausibility assessments might change. Provided that the series of reassessments of prior plausibilities doesn’t push the non-evidential plausibility assessments of the true hypothesis ever nearer to zero, the Likelihood Ratio Convergence Theorem implies that the evidence will very probably bring the posterior probabilities of the empirically distinct rivals to approach 0 via decreasing likelihood ratios; and as this happens, the posterior probability of the true hypothesis will head towards 1. 4
WHAT IS CONFIRMATIONAL PROBABILITY ANYWAY?
If confirmation functions aren’t some sort of normative guide to what we may legitimately believe, then they are useless, and probabilistic confirmation theory is a pointless enterprise. Ideally a confirmation function should be a kind of truth-indicating index. When things are working right, evidence should induce a confirmation function to indicate the falsehood of false hypotheses by tagging them with confirmational probability numbers near 0, and it should eventually induce the confirmation function to indicate a true hypothesis by tagging it with confirmation numbers that approach 1. Provided a confirmation function has this truth-indicating feature, it makes good epistemic sense for degree of confirmation to influence belief strength. But exactly how is this supposed to work. Precisely how is the degree of confirmation of a hypothesis supposed to hook up with one’s level of confidence or degree of belief in its truth or falsehood? Views about the nature of confirmation functions, about what they really are, should be sensitive to this question. A theory of confirmation that cannot reasonably tie confirmation to appropriate belief provides only a useless contrivance.
4.1
Against Syntactic-Structural Versions of a Logical Interpretation: Grue-Hypotheses
Some Bayesian logicists maintain that confirmation is logical in the same way that deductive logic is logical, and that it should play an analogous role in informing the beliefs of agents. Without going into a lot of technical detail, let’s get a basic handle on what these logical confirmation functions are supposed to be like.
364
James Hawthorne
A leading idea is that the posterior probabilities of hypotheses should be determined by logical structure alone. The idea of basing probabilities on syntactic structure may seem plausible enough in the case of likelihoods that deductively or statistically relate hypotheses to the evidence. So, if logical form could also be made to determine the values of prior probabilities, then the logic of confirmation would be fully formal in the same way that deductive logical entailment is formal — i.e., it would be based only on the syntactic structure of the sentences involved. Such confirmation functions would be logical probabilities in the sense that their values would be uniquely specified by the syntactic structures of the sentences of the language. A sufficiently rigorous version of this approach would specify a uniquely best way of assigning logically appropriate priors to hypotheses, resulting in a single uniquely best logical confirmation function. This confirmation function would be completely objective in that it would not be influenced by anyone’s subjection assessments of plausibilities of various hypotheses. Keynes and Carnap each tried to implement this kind of approach through syntactic versions of the principle of indifference. The idea is that hypotheses that share the same syntactic structure should be assigned the same prior probability values. Carnap showed how to carry out this project in detail, but only for extremely simple formal languages. Most logicians now take this project to have failed because of a fatal flaw in the whole idea that reasonable prior probabilities can be made to depend on logical form alone. Semantic content should matter. Goodmanian grue-predicates provide one way to illustrate this point.30 Call an object grue at a given time just in case “either the time is earlier than the first second of the year 2030 and the object is green, or the time not earlier than the first second of 2030 and the object is blue”. Now the statement ‘All emeralds are grue (at all times)’ has the same syntactic structure as ‘All emeralds are green (at all times)’. So, if syntactic structure determines priors, then these two hypotheses should have the same prior probabilities. Indeed, both should have prior probabilities approaching 0. For, there are an infinite number of competitors of these two hypotheses, each sharing the same syntactic structure: consider the hypotheses ‘All emeralds are gruen (at all times)’, where an object is gruen at a given time just in case “either the time is earlier than the first second of the nth day after January 1, 2030, and the object is green, or the time is not earlier than the first second of the nth day after January 1, 2030, and the object is blue.” A purely syntactic specification of the priors should assign all of these hypotheses the same prior probability. But these are mutually exclusive hypotheses; so their prior probabilities must sum to a value no greater than 1. The only way this can happen is for ‘All emeralds are green’ and each of its gruen competitors to have prior probability values equal to 0. In that case the green hypothesis can never receive a posterior probability above 0. One might object that the predicate ‘grue’ is defined in terms of ‘green’, and so hides the extra syntactic complexity. But from a purely formal, syntactic point of 30 Goodman [1955] introduced predicates of the following sort as a challenge to inductive logic. However, the details of my example and the use to which I’ll put it differs from Goodman’s.
Confirmation Theory
365
view (which is all this view is entitled to), the predicates we happen to actually employ are only an accident of the language we happen to speak. We could have spoken the grue-language, where ‘grue’ is the more primitive predicate, where the predicate ‘green’ is defined, and hides the extra syntactic complexity. Here’s how to spell out this point in detail. Suppose the grue-language also contains a predicate ‘bleen’ which, translated to our usual language works like this: an object is bleen at a given time just in case “either the time is earlier than the first second of the year 2030 and the object is blue, or the time not earlier than the first second of 2030 and the object is green”. Now, it is easy to show that from the perspective of the grue-language our predicate ‘green’ is defined as follows: an object is green at a given time just in case “either the time is earlier than the first second of the year 2030 and the object is grue, or the time not earlier than the first second of 2030 and the object is bleen”. (‘Blue’ may be similarly defined). The point is that from a purely logical perspective there is no reason to prefer one set of primitive predicates over another. Presumably part of the mission of confirmation theory is to discover what hypotheses, couched in terms of what primitive predicates, best describe the world. The syntactic-structural view attempts to avoid prejudicing the confirmatory process by assigning prior probabilities in a logically/syntactically unbiased way. The grue example shows that this can’t work. If you pick a preferred set of predicates, you build in a bias. If you don’t pick a preferred set, then all of the grue-like hypotheses must be given equal footing to the green hypothesis. But then all prior probabilities must be 0 (or so close to 0 that no significant amount of confirmation can occur). Even if some version of the syntactic-structural approach could be made to work, its advocates would still owe us an account of how, and why, such confirmation functions should inform our belief strengths for various hypotheses. In particular, for cases where the evidence is not yet sufficient to strongly favor one specific hypothesis over an alternative (i.e. where the likelihood ratio is near 1), why should an agent’s belief strength (or level of confidence) be governed by the syntactic structure of these hypotheses, rather than by their (semantic) meanings together with whatever plausibility considerations make the most sense to the scientific community? The defenders of the syntactic-structural view owe us credible reasons to conform belief to their confirmation functions.
4.2
Against the Subjective Belief-Function Interpretation: the Problem of Old Evidence
The subjectivist or personalist Bayesian view solves the problem of how confirmation is supposed to influence belief in the most direct way possible. It says that the agent’s confirmation function Pα should just be his belief function, Bel α , where Bel α is a probability function that measures how confident the agent is (or should be) in the truth of various statements. Belief is, of course, dynamic. We learn new truths, including evidence claims. On the subjectivist account, upon learning new evidence e, an agent α is supposed to update his belief strengths
366
James Hawthorne
via Bayesian conditionalization: for all sentences S (including the hypotheses hj ), Belα:new [S] = Belα:old [S|e]. This is where Bayes’ Theorem comes in. When sentence S is a hypothesis hi , we have (from combining Equations 10 and 11, but suppressing ‘c’ and ‘b’, as subjectivists usually do): P Belα:old [e|hj ] Belα:old [hj ] × Belα:new [hi ] = Belα:old [hi |e] = 1 1+ j6=i Belα:old [e|hi ] Belα:old [hi ]
(where the catch-all hypothesis, if needed, is included among the hypotheses hj ). This is how subjectivist Bayesians employ Bayes’ Theorem to represent the updating of belief strengths on new evidence. Formally this account works just fine. However there are reasons for thinking that confirmation functions must be distinct from subjectivist or personalist degree-of-belief functions. One such reason is the problem of old evidence.31 To understand the problem we need to first consider more carefully what belief functions are supposed to represent. Bayesian belief functions are supposed to provide an idealized model of belief strengths for agents. They extend the notion of ideally consistent belief to a probabilistic notion of ideally coherent belief strengths. I have no objection to this kind of idealization as a normative guide for real decision making. An agent is supposed to make decisions based on her belief strengths about the state of the world, her belief strengths about possible consequences of her actions, and her assessment of the desirability (or utility) of these consequences. But the very role that belief functions are supposed to play in decision making makes them ill-suited to hypothesis confirmation, where the likelihoods are often supposed to be objective, or at least possess intersubjectively agreed values that represent the empirical import of hypotheses. That is, for the purposes of decision making, degree-of-belief functions should represent the agent’s belief strengths based on everything she presently knows. But then the degree-of-belief likelihoods must represent how strongly the agent would believe the evidence if a hypothesis hi were added to everything else she presently knows. This makes them quite different than confirmation function likelihoods, which represent what the hypothesis (together with explicit background and experimental conditions) says or implies about the evidence. In particular, degree-of-belief likelihoods are saddled with a version of the problem of old evidence, a problem not shared by confirmation function likelihoods. Here is the problem. An evidence statement e may be well-known far in advance of the time when we first attempt to account for it with some new hypothesis or theory. For example, the rate of advance in Mercury’s perihelion was known well before Einstein developed the theory of General Relativity, and then figured out how the theory could account for that phenomenon. If the agent is already certain of an evidence statement e before using e to test a hypothesis, then her belieffunction likelihoods for e must have value 1, so the belief-function likelihood must 31 Glymour (1980) first raised this problem. Eells (1985) extends the problem. For a more extensive version of the following treatment see (Hawthorne 2005).
Confirmation Theory
367
also be 1 on each alternative hypothesis. That is, if Belα:old is α’s belief function and she already knows that e, then Belα:old [e] = 1. It then follows from the axioms of probability theory that Belα:old [e|hi ] = 1 as well, regardless of what hi says, indeed even if hi says that e is very unlikely. This problem runs even deeper. It not only applies to evidence that the agent knows with certainty. It turns out that almost anything the agent learns that could influence how strongly she believes e will also influence the value of her belief-function likelihood for e, because a belief function Belα:old [e|hi ] represents the agent’s belief strength given everything she knows. To see the difficulty with less-than-certain evidence, consider the following example (where I’ll continue to suppress the ‘b’ and ‘c’ terms.) A physician intends to use a treadmill test to find evidence about whether her patient has heart disease, h. She knows from medical studies that there is a 10% false negative rate for this test; so her belief strength for a negative result, e, given heart disease is present, h, is Belα:old [e|h] = .10. Now, her nurse is very professional and is usually unaffected by patients’ test results. So, if asked, the physician would say her belief strength that her nurse will “feel awful about it”, s, “if the test is positive” (i.e. if ∼ e) is around Belα:old [s| ∼ e] = .05. Let us suppose, as seems reasonable, that this belief strength is independent of whether h is in fact true — i.e. Belα:old [s| ∼e · h] = Belα:old [s| ∼e]. The nurse then tells the physician, in a completely convincing way, “if his test comes out positive, I’ll feel just awful about it.” The physician’s new belief function likelihood for a false negative must then become Belα:new [e|h] = Belα:old [e|h · (∼e ⊃ s)] = .69.32 Now, if a negative test result comes back from the lab, what likelihood is the physician supposed to use in her evaluation of the patient’s prospects for having heart disease, her present personal belief-function likelihood, Bel α:new [e|h] = .69, or the “real” false-negative rate likelihood, P [e|h] = .10? The main point is that even the most trivial knowledge of conditional (or disjunctive) claims involving e may completely upset the objective values of likelihoods for an agent’s belief function. And an agent will almost always have some such trivial knowledge. E.g., the physician in the previous example may also learn that if the treadmill test is negative for heart disease, then, (1) the patient’s worried mother will throw a party, (2) the patient’s insurance company won’t cover additional tests, (3) it will be the thirty-seventh negative treadmill test result she has received for a patient this year,. . . , etc. Updating on such conditionals can force the physician’s belief function likelihoods to deviate widely from the evidentially objective, textbook values for likelihoods. More generally, it can be shown that the incorporation into Bel α of almost any kind of evidence for or against the truth of a prospective evidence claim e — even uncertain evidence for e, as may come through Jeffrey updating33 — completely 32 Since Bel α:old [e|h · (∼e ⊃ s)] = Belα:old [∼e ⊃ s|h · e] × Belα:old [e|h]/(Belα:old [∼e⊃s|h · e] × Belα:old [e|h] + Belα:old [∼e⊃ s | h · ∼ e] × Belα:old [∼e|h]) = Belα:old [e|h]/(Belα:old [e|h] + Belα:old [s| ∼e·h] × Belα:old [∼e|h]) = .1/(.1 + (.05)(.9)) = .69. 33 See Jeffrey [1965; 1987; 1992].
368
James Hawthorne
undermines the objective or intersubjectively agreed likelihood values that a belief function might have otherwise expressed.34 This should be no surprise. The agent’s belief function likelihoods reflect her total degree of belief in e, based on h together with everything else she knows about e. So the agent’s present belief function may capture objective likelihoods for e only if the possibility of the truth of e can be completely isolated from all of the agents other beliefs. And this will rarely be the case. One Bayesian subjectivist response to this kind of problem is that the belief functions employed in scientific inferences should often be “counterfactual belief functions”, which represent what the agent would believe if e were subtracted (in some suitable way) from everything else she knows (see, e.g. [Howson and Urbach, 1993]). However, our example shows that merely subtracting e won’t do. One must also subtract all conditional and disjunctive statements containing e. And one must subtract any uncertain evidence for or against e as well. So the counterfactual belief function idea needs a lot of working out if it is to rescue the idea that subjectivist Bayesian belief functions can provide a viable account of the likelihoods employed by the sciences. There is important work for the degree of belief notion to do as part of our best formal account of belief and decision. But degree-of-confirmation functions, associated with objective or public likelihoods, do different work. It seems that confirmation functions should help guide changes in belief, but they should not themselves be agents’ belief functions. Taking probabilistic confirmation functions to be degree-of-belief functions, even counterfactual ones, forces the degree of belief conception into a mold that doesn’t suit it given the other work it does. Better to keep these two notions distinct, and connect them with an account of how degree-of-confirmation should inform degree-of-belief.
4.3
How Confirmational Support should influence Belief Strength: the Truth-Index Interpretation
Rather than ask what a confirmation function is, perhaps it’s more fruitful to ask what a confirmation function is supposed to do. That is, I want to suggest a kind of functionalist view of the nature of confirmation functions. You might call this the they-are-what-they-do interpretation. But what is a confirmation function designed to do? What is its functional role supposed to be? As I see it, a confirmation function is supposed to be a kind of truth-indicating index. It can be expected to successfully perform this function when things are working right. That is, when things are working right, an increasing stream of evidence will induce a confirmation function to indicate the falsehood of false hypotheses by tagging them with confirmational probability numbers approaching 0, and the evidence stream will eventually induce the confirmation function to indicate the truth of a true hypothesis by tagging with it confirmational probability 34 See
[Hawthorne, 2005] for more details.
Confirmation Theory
369
numbers that approach 1. What does it take for “things to work right”? Although spelling this out is not completely trivial, it’s not as daunting as one might think. If, among the alternative hypotheses proposed to account for a given subjectmatter we are fortunate enough to think up a hypothesis that happens to in fact be true, and if we find ways to empirically test it against rivals, then all that’s needed for success is persistence and not too much bad luck with how the evidence actually turns out. For, according to the Likelihood Ratio Convergence Theorem (section 5), the true hypothesis itself says, via its likelihoods, that a long enough (but finite), stream of observations or experiments is very likely to produce outcomes that will drive the likelihood ratios that compare empirically distinct false competitors to the true hypothesis to approach 0. As this happens, the confirmation index of these competitors, as measured by their posterior probabilities, also approaches 0, and the confirmation index of the true hypothesis (or at least its disjunction with empirically equivalent rivals) will approach 1. One must read this result carefully, however. The result does not imply that whatever hypothesis has index near 1 at a given moment is likely to be the true alternative. The convergence theorem doesn’t say that. Rather, the result suggests the pragmatic strategy of continually testing hypotheses, and taking whichever of them has an index near 1 (if there is one) as the best current candidate for being true. The convergence theorem implies that maintaining this strategy of continual testing is very likely to eventually promote the true hypothesis (or its disjunction with empirically indistinguishable rivals) to the status of best current candidate, and maintain it there. Thus, this strategy is very likely to eventually produce the truth for us. But notice, the theorem doesn’t imply that we’ll ever be in a position to justly be certain that our best current candidate is the true alternative. Thus, this eliminative strategy promises to work only if we continue to look for rivals and continue to test the best alternative candidates against them. This strategy shouldn’t seem novel or surprising. It’s merely a rigorously justified version of scientific common sense. When the kind of empirical evidence that’s related to hypotheses via likelihoods is too meager to distinguish between a particular pair of hypotheses, the confirmation index must rely on whatever our most probative “non-evidential” considerations can tell us. We often have good reasons besides the evidence from likelihoods to discount some logically possible alternatives as just too implausible, or as significantly less plausible than some better conceived competitors. Indeed, we always bring some such considerations to bear, at least implicitly. For, given any specific hypothesis, logicians know how to easily cook up numerous alternatives that agree with it on all the evidence gathered thus far. Any reasonable scientist will reject most of these inventions immediately, because they look ad hoc, contrived, or just foolish. Such reasons for rejection appeal to neither purely logical characteristics of these hypotheses, nor to the usual sorts of evidential considerations. Such reasons bring into play plausibility consideration, some purely conceptual, some broadly empirical. I refer to plausibility considerations as broadly empirical when they reflect our experience, but their import cannot be properly captured by
370
James Hawthorne
observation conditions c and evidential outcomes e in evidential likelihoods that express what the alternative hypotheses at issue say about such evidence. On a Bayesian account, whatever cannot be represented by likelihoods may only be introduced via the “prior” probabilities. They are the conduit through which any legitimate considerations not expressed by likelihoods may be brought to bear in the net evaluation of scientific hypotheses. This all suggests that the normative connection between confirmation and belief should go something like this: The Belief-Confirmation Alignment Condition: Each agent should bring her belief strengths for hypotheses into alignment with their degrees of confirmation due to all the relevant evidence of which she is aware, and where the confirmation function she employs draws on prior probabilities that represent her best estimates of the comparative plausibilities of alternative hypotheses based on all relevant non-evidential considerations of which she is aware. That is, if Pα is her confirmation function, and she is certain of background and auxiliaries b and evidence cn · en , and this is the totality of her evidence that is relevant to hi , then her belief strength Bel α should be (or become) Bel α [hi ] = Pα [hi |b·cn ·en ]. Furthermore, if she has partial or uncertain evidence that’s relevant to hi , then her belief strength should be the weighted sum of the degrees of confirmation of the hypothesis on each possible alternative evidence sequence cn·en , weighted by her belief strengths for each of those possible evidence sequences (and similarly for possible alternative auxiliaries b, if they are uncertain), as follows:35 X Pα [hi |b·cn ·en ] × Belα [b·cn ·en ]. Belα [hi ] = {b ·cn ·en }
The Alignment Condition may be difficult for real agents to follow precisely. But it should be a normative guide for real agents (much as Bayesian decision theory is supposed to be a normative guide).36 The Alignment Condition merely recommends that a real agent’s confidence in scientific hypotheses should conform to the 35 If the agent is certain of some particular bit of evidence c ·e in the evidence stream, her k k belief function will assign belief strength 0 to each possible evidence sequence cn ·en that fails to contain ck ·ek — i.e., Bel α [b·cn ·en ] = 0 for all cn ·en that don’t contain ck ·ek . 36 The idea that Bayesian epistemology should draw on two distinct probability functions in roughly this way was suggested by Carnap (1971). He calls the degree of belief notion ‘rational credence’, and he calls the degree of confirmation notion ‘credibility’. Carnap takes initial credence functions to derive from credibility functions, which themselves are taken to be logical probability functions. Brian Skyrms largely adopts this Carnapian idea in the third edition of Choice and Chance (1986, Ch. 1, Ch. 6, Sects. 7 and 8), though he doesn’t identify his version of credibility functions with Carnapian logical probabilities. Skyrms calls the degree of belief notion ‘epistemic probability’, and calls the degree of confirmation notion ‘inductive probability’. More recently Marc Lange [1999] also argues for a two function Bayesian model. See [Hawthorne, 2005] for more about the alignment of belief with confirmational support.
Confirmation Theory
371
level indicated by her confirmation function, moderated by how confident she is in the truth of the evidence claims. It shouldn’t be overly difficult for real agents to approximately align belief to confirmation in this way. Furthermore, supposing (as argued earlier) that probabilistic confirmation functions should not just be belief functions, the Alignment Condition shows how probabilistic confirmation can plausibly be made to mesh with the usual Bayesian account of belief and decision. The Alignment Condition is recommended as a norm by the following fact: if the agent comes to strongly believe true evidence statements, then alignment takes advantage of Likelihood Ratio Convergence to very probably bring the agent to strongly doubt false hypotheses and strongly believe true ones. What better recommendation for the formation of belief strengths about scientific hypotheses could one reasonably expect to have?
5
THE LIKELIHOOD RATIO CONVERGENCE THEOREM
The Likelihood Ratio Convergence Theorem shows that when hi is true and hj is empirically distinct from hi on a sequence of experiments and observations cn , then (provided b is also true) it’s very likely that a sequence of outcomes en will result that yields a sequence of likelihood ratios P [en | hj ·b·cn ]/P [en | hi·b·cn ] that approach 0 as the evidence accumulates (i.e., as n increases). The theorem places an explicit bound on the rate of probable convergence. That is, it puts a lower bound that approaches 1 on how likely it is that, when hi is true, some stream of outcomes will occur that yields a likelihood ratio within a specified small region of 0 (counting heavily against the truth of alternative hj ). This convergence theorem draws only on likelihoods. Neither the statement of the theorem nor its proof employs prior probabilities of any kind. Likelihoodists and Bayesian confirmationists agree that when the sequence of likelihood ratios P [en |hj · b · cn ]/P [en |hi · b · cn ] approach 0 for increasing n, the evidence goes strongly against hj as compared to hi . So, even likelihoodists, who eschew the use of prior probabilities, may embrace this result. For Bayesians, the Likelihood Ratio Convergence Theorem has the additional implication that the posterior probabilities of empirically distinct false competitors of a true hypothesis are very likely to converge to 0. That’s because whenever the ratios P [en |hj · b · cn ]/P [en |hi · b · cn ] approach 0 for increasing n, the Ratio Form of Bayes’ Theorem, Equation 9, says that the posterior probability of hj will also approach 0. The values of prior probabilities only accelerate or retard this process of convergence. This also implies that all confirmation functions in a collection that constitutes a vagueness set (that represent the range of vagueness in an agent’s assessments of the prior plausibilities of hypotheses) will very likely come to near agreement, all coming to agree that the posterior probability of false alternatives approach 0.37 And as the posterior probabilities of false competitors approach 0, 37 The same goes for diversity sets, which represent the range of plausibility assessments among members of a scientific community.
372
James Hawthorne
the posterior probability of the true hypothesis heads towards 1. The Likelihood Ratio Convergence Theorem avoids or overcomes the usual objections raised against Bayesian convergence results: • The theorem does not employ second-order probabilities — it doesn’t rely on assessing the probability of a probability. The theorem only concerns the probability of particular disjunctive sentences that represent possible sequences of outcomes. • The theorem does not rely on countable additivity (to which some commentators have objected with regard to other convergence results). • The theorem does not require that evidence consist of sequences of outcomes that, according to the hypotheses, are identically distributed (like repeated tosses of a die). The version of the theorem I’ll present does, however, suppose that the evidential outcomes in the sequence of experiments or observations are probabilistically independent of one another given each hypothesis (or at least that the outcomes can be grouped into clusters that are probabilistically independent of one another). Another version of this theorem (not presented here) applies without supposing probabilistic independence. Nevertheless, I will argue that the sort of probabilistic independence that the present version of the theorem draws on should almost always be present in real scientific contexts. • The rate of likely convergence of the likelihood ratios is explicitly calculable from the likelihoods specified by individual hypotheses. These convergence rates depend only on finite sequences of experiments or observations. So this theorem overcomes the often repeated objection that Bayesian convergence results may only apply in the infinite long run (when we’ll all be long dead). • The values of prior probabilities for hypotheses need not be “locked in” permanently for the theorem to apply. Indeed, the theorem itself doesn’t draw on prior probabilities at all. It only employs likelihoods. However, the convergence of the likelihoods leads directly to the convergence of posterior probabilities; and this convergence of posteriors occurs even if agents reassess the non-evidential plausibilities of hypotheses from time to time, and assign new prior probabilities accordingly. This last point needs some explanation. It is sometimes objected that Bayesian convergence results only work when prior probabilities are held fixed — that the theorems fall through if an agent is permitted to change her evidence-independent assessments of prior plausibilities from time to time. Critics point out that real agents may quite legitimately change their assessments of the evidence-independent plausibilities of hypotheses, perhaps due to newly developed plausibility arguments, or due to the reassessment of old ones. A Bayesian confirmation theory has to represent such reassessments as non-Bayesian shifts from one confirmation
Confirmation Theory
373
function (or from one vagueness or diversity set of confirmation functions) to another. But, critics object, Bayesian convergence theorems always assume that the only dynamic element in the confirmational process is due to the addition of new evidence, brought to bear by the associated likelihoods. This kind of updating of posterior probabilities via Bayes’ Theorem is supposed to be the only means of updating available to the Bayesian. Thus, it looks like Bayesian confirmation is severely handicapped as an account of scientific hypothesis evaluation. However, the Likelihood Ratio Convergence Theorem is not subject to this kind of objection. It applies even if agents revise their evidence-independent priors from time to time. The theorem itself only involves the values of likelihoods. Thus, provided that reassessments of prior plausibilities do not push the non-evidential plausibility of the true hypothesis down towards 0 too rapidly, the theorem shows that posterior probabilities of the empirically distinct false competitors of a true hypothesis will very probably approach 0 as evidence increases.38 I raise these points in advance so that the reader may be on the look-out to see that the theorem really does avoid these challenges.39 Let’s now turn to the details.40 38 That is, for each confirmation function P , the posterior P [h |b · cn · en ] must go to 0 if α α j the ratio Pα [hj |b · cn · en ]/Pα [hi |b · cn · en ] goes to 0; and that will occur if the likelihood ratios P [en |hj · b · cn ]/P [en |hi · b · cn ] approach 0 and the prior Pα [hi |b] is a bit greater than 0. The Likelihood Ratio Convergence Theorem will show that when hi ·b is true, it is very likely that the evidence will indeed be such as to drive the likelihood ratios as near to 0 as you please (given a long enough evidence stream). As that happens, the only way a Bayesian agent can avoid having his confirmation function yield posterior probabilities for hj that approach 0 (as n gets large) is to continually switch among confirmation functions (moving from Pα to Pβ to ... to Pγ to . . . ) in a way that revises the prior probability of the true hypothesis, hi , downward towards 0. And even then he can only avoid having the posterior probability for alternative hj approach 0 by continually switching to new confirmation functions at a rate that keeps the new priors for hi diminishing towards 0 at least as quickly as the likelihood ratios that disfavor hj (as compared to hi ) diminish towards 0. To see this, suppose, to the contrary, that P [en |hj ·b·cn ]/P [en |hi·b·cn ] approaches 0 faster than does sequence Pγ [hi |b], for changing Pγ and increasing n — i.e. approaches 0 faster in the sense that (P [en |hj·b·cn ]/P [en |hi·b·cn ])/Pγ [hi |b] goes to 0, for changing Pγ and increasing n. Then, we’d have (P [en |hj ·b·cn ]/P [en |hi·b·cn ])/Pγ [hi |b] > (P [en |hj ·b·cn ]/P [en |hi ·b·cn ])·(Pγ [hj |b]/Pγ [hi |b]) = Pγ [hj |b·cn ·en ]/Pγ [hi |b·cn ·en ]. So, the ratios of posterior probabilities Pγ [hj |b·cn ·en ]/Pγ [hi |b·cn ·en ] must still go to 0, for changing Pγ and increasing n; and thus, so must Pγ [hj |b·cn ·en ]. 39 My version of the theorem is related to L. J. Savage’s (1954) convergence theorem, but generalizes that result considerably. In particular, Savage’s theorem supposes that the outcomes are both independent and identically distributed, whereas the Likelihood Ratio Convergence Theorem (hereafter ‘LRCT’) only supposes independent outcomes, not identical distribution. Savage’s version does not provide bounds on the rate of convergence, while the LRCT will provide explicit lower bounds. And Savage’s theorem is stated in terms of the convergence of posterior probabilities to 0 or 1 for each of a pair of alternative hypotheses, whereas the LRCT deals directly with likelihoods and likelihood ratios, and does not involve prior probabilities. 40 For a nice presentation of the most prominent Bayesian convergence results and a discussion of their weaknesses see [Earman, 1992, Ch. 6]. Earman was not aware of the Likelihood Ratio Convergence Theorem I’ll be presenting here. Among the convergence results discussed by Earman is a well-known result due to Gaifman and Snir [1982] (hereafter ‘G&S’). The G&S theorem result is a strong law of large numbers result, and so may at first blush appear to be a stronger result than the Likelihood Ratio Convergence Theorem, which is a weak law of large numbers result. However, in important respects the LRCT
374
5.1
James Hawthorne
The Space of Possible Outcomes of Experimental and Observational Conditions
To spell out the details of the Likelihood Ratio Convergence Theorem we’ll need a few additional notational conventions and definitions. Here they are. For a sequence of n experiments or observations cn , consider the set of those possible sequences of outcomes that would result in likelihood ratios for hj over hi that are less than some chosen small number ε > 0. This set is represented by the following expression: {en : P [en |hj ·b·cn ]/P [en |hi ·b·cn ] < ε} One may choose any small value of ε that seems interesting, and then form the corresponding set. Placing the disjunction symbol ‘∨’ in front of this expression yields an expression ∨{en : P [en |hj ·b·cn ]/P [en |hi ·b·cn ] < ε}, that represents the disjunction of all outcome sequences in this set. So, the expression ‘∨{en : P [en |hj · b · cn ]/P [en |hi · b · cn ] < ε}’ just represents a particular sentence that says, in effect, “one of those sequences of outcomes from the first n experiments or observations will occur that makes the likelihood ratio for hj over hi less than ε.” The Likelihood Ratio Convergence Theorem says that, for any specific ε you choose, the likelihood of a disjunctive sentence of this sort, given that ‘hi ·b·cn ’ is true, P [∨{en : P [en |hj ·b·cn ]/P [en |hi ·b·cn ] < ε} | hi ·b·cn ] must have a value of at least 1 − (ψ/n), for some explicitly calculable term ψ. And clearly this lower bound, 1 − (ψ/n), will approach 0 as n increases. Thus, is actually the stronger of the two results. For, although the G&S result may arguably have a stronger consequent, it relies on a much stronger antecedent — it supposes that the evidence is separating in the sense that, for any pair of possible worlds (or models of the formal language), there is an evidence statement E that is true in one world and false in the other world. This is a very strong assumption, much stronger than anything the LRCT depends on. By contrast, the LRCT supposes nothing stronger than conditions that are commonly present in the testing of real scientific theories. The consequent of the G&F result shows probabilistic convergence to the truth-value of a hypothesis almost everywhere — i.e. on a set of models, or possible worlds, of measure 1. However, as is common with “strong law” results, it says nothing about how fast convergence is likely to be. Thus, the G&F result is open to the “infinite long run” objection mentioned above. By contrast, the LRCT only shows that the likelihood approaches 1 that the likelihood ratios comparing false competitors to the true hypothesis approach 0 as evidence increases. However, this consequent of the LRCT also provides explicit information about how fast this kind of convergence is likely to be, and it shows exactly how the likely rate of convergence depends on the degree to which likelihood values of possible outcomes differ for the alternative hypotheses. Thus, although the consequent of the LRCT is weaker than the G&S result in one way, it is stronger in another way. See [Earman, 1992] for a detailed critique of the G&S result.
Confirmation Theory
375
the true hypothesis hi implies that as the amount of evidence, n, increases, it is highly likely (as close to 1 as you please) that one of the outcome sequences en will occur that yields a likelihood ratio P [en |hj ·b·cn ]/P [en |hi ·b·cn ] less than ε, for any value of ε you may choose. As this happens, the posterior probability of hi ’s false competitor, hj , must approach 0, as required by the Ratio Form of Bayes’ Theorem. The term ψ in the theorem depends on a measure of the empirical distinctness of the two hypotheses for the proposed sequence of experiments and observations cn . To specify this measure we need to contemplate not only the actual outcomes, but the collection of alternative possible outcomes of each experiment or observation. So, consider some sequence of experimental or observation conditions described by sentences c1 , c2 , . . ., cn . Corresponding to each condition ck there will be some range of possible alternative outcomes; let Ok = {ok1 , ok2 , . . ., okw } be a set of statements describing the alternative possible outcomes for condition ck . (The number of alternative outcomes will usually differ for distinct experiments c1 , . . ., cn ; so, the value of w depends on each specific ck ). For each hypothesis hj , the alternative outcomes of ck in Ok are mutually exclusive and exhaustive — that is, we have: P [oku ·okv |hj ·b·ck ] = 0 and
w X
u=1
P [oku |hj ·b·ck ] = 1.
Expressions like ‘ek ’ represent possible outcomes of ck — i.e., ‘ek ’ ranges over the members of Ok . As before, ‘cn ’ denotes the conjunction of the first n test conditions, (c1 ·c2 ·. . . ·cn ), and ‘en ’ represents possible sequences of corresponding outcomes, (e1·e2 ·. . . ·en ). We’ll take ‘E n ’ to represent the set of all possible outcome sequences form conditions cn . So, for each hypothesis hj (including hi ), we have P n n en ∈E n P [e |hj ·b·c ] = 1. There are no substantive assumptions in any of this — only notational conventions.
5.2
About Probabilistic Independence
In almost all scientific contexts the outcomes in a series of experiments or observations are probabilistically independent of one another relative to each hypothesis under consideration. We may divide the kind of independence involved into two types. Definition: Independent Evidence Conditions: 1. A sequence of outcomes ek is condition-independent of a condition for an additional experiment or observation ck+1 , given h·b and its own conditions ck , if and only if P [ek |h·b·ck ·ck+1 ] = P [ek |h·b·ck ]. 2. An individual outcome ek is result-independent of a sequence of other observations and their outcomes (ck−1 ·ek−1 ), given h·b and its own condition ck , if and only if P [ek |h·b·ck ·(ck−1 ·ek−1 )] = P [ek |h·b·ck ].
376
James Hawthorne
When these two conditions hold, the likelihood for a sequence of experiments or observations may be decomposed into the product of the likelihoods for individual experiments or observations. To see how the two independence conditions affect the decomposition, first consider the following formula, which holds even if neither independence condition is satisfied: (1) P [en |hj ·b·cn ] =
n Y
k=1
P [ek |hj ·b·cn ·ek−1 ].
When condition-independence holds, the likelihood of the whole evidence stream parses into a product of likelihoods that probabilistically depend on only past observation conditions and their outcomes. They do not depend on the conditions for other experiments whose outcomes are not yet specified. Here is the formula: (2) P [en |hj ·b·cn ] =
n Y
k=1
P [ek |hj ·b·ck ·(ck−1 ·ek−1 )].
Finally, whenever both independence conditions are satisfied we obtain the following relationship between the likelihood of the evidence stream and the likelihoods of individual experiments or observations:41 (3) P [en |hj ·b·cn ] =
n Y
k=1
P [ek |hj ·b·ck ].
In almost all scientific contexts both clauses of the Independent Evidence Condition will be satisfied. To see this, let us consider each independence condition more carefully. Condition-independence says that the mere addition of a new observation condition ck+1 , without specifying one of its outcomes, does not alter the likelihood of the outcomes ek of other experiments ck . To appreciate the significance of this condition, imagine how the world would be if it were violated. Suppose hypothesis hj is some statistical theory, say, a quantum theory of superconductivity. The conditions expressed in ck describe a number of experimental setups, perhaps conducted in numerous labs, that test a variety of aspects of the theory (e.g., experiments that test electrical conductivity in different materials at a range of temperatures). Outcome sequence ek describes the results of these experiments. The violation of condition-independence would mean that merely adding to hj ·b·ck a statement ck+1 describing the set-up of an additional experiment, but with no mention of its outcome, changes how likely the evidence sequence ek is: i.e., P [ek |h·b·ck ·ck+1 ] = 6 P [ek |h·b·ck ]. What (hj ·b) says, via likelihoods, about the outk comes e of experiments ck differs as a result of merely supplying a description of another experimental arrangement, ck+1 . Condition-independence, when it holds, rules out such strange effects. 41 For derivations of equations (13) and (14) see [Hawthorne, 2004, supplement 3] at http: //plato.stanford.edu/entries/logic-inductive/supplement3.html
Confirmation Theory
377
Result-independence says that the description of previous test conditions together with their outcomes is irrelevant to the likelihoods of outcomes for additional experiments. If this condition were widely violated, then in order to specify the most informed likelihoods for a given hypothesis one would need to include information about volumes of past observations and their outcomes. What a hypothesis says about future cases would depend on how past cases have gone. Such dependence had better not happen on a large scale. Otherwise, the hypothesis would be fairly useless, since its empirical import in each specific case would depend on taking into account volumes of past observational and experimental results. However, even if such dependencies occur, provided they are not too pervasive, result-independence can be accommodated rather easily by packaging each collection of result-dependent data together, treating it like a single extended experiment or observation. The result-independence condition will then be satisfied by letting each term ‘ck ’ in the statement of the independence condition represent a conjunction of test conditions for a collection of result-dependent tests, and by letting each term ‘ek ’ (and each term ‘oku ’) stand for a conjunction of the corresponding result-dependent outcomes. Thus, by packaging result-dependent data together in this way, the result-independence condition is satisfied by those (conjunctive) statements that describe the separate, result-independent chunks.42 The version of the Likelihood Ratio Convergence Theorem I’ll present depends on the usual axioms of probability theory together with the Independent Evidence Conditions. It depends on no other assumptions (except those explicitly stated in the antecedent of the theorem itself). Thus, from this point on, let’s suppose that the following two assumptions holds. Independent Evidence Assumptions: For each hypothesis h and background b under consideration, let’s assume that the experiments and observations can be packaged into condition statements, c1 , . . . , ck , ck+1 , . . . , and possible outcomes in a way that satisfies the following independence conditions: 1. Each sequence of possible outcomes ek of a sequence of conditions ck is condition-independent of additional conditions ck+1 — i.e., P [ek |h·b·ck ·ck+1 ] = P [ek |h·b·ck ].
2. Each possible outcome ek of condition ck is result-independent of
42 In scientific contexts the most prominent kind of case where data may fail to be resultindependent is where some quantity of past data helps tie down the numerical value of a parameter not completely specified by the hypothesis at issue, where the value of this parameter influences the likelihoods for outcomes of lots of other experiments. Such hypotheses (with their free parameters) are effectively disjunctions of more specific hypotheses, where each distinct disjunct is a distinct version of the original hypothesis that has a specific value for the parameter filled in. Evidence that “fills in the value” for the parameter just amounts to evidence that refutes (via likelihood ratios) those alternative more specific, filled-in hypotheses that possess incorrect parameter values. For any specific, filled-in hypotheses, the evidence that bears on whether it has the correct parameter value will be independent of other evidence that relies on the parameter value. So, relative to each of these more specific hypotheses, result-independence holds.
378
James Hawthorne
sequences of other observations and possible outcomes (ck−1 ·ek−1 ) — i.e., P [ek |h·b·ck ·(ck−1 ·ek−1 )] = P [ek |h·b·ck ]. We now have all that is needed to begin to state the Likelihood Ratio Convergence Theorem. The convergence theorem comes in two parts. The first part applies to only those experiments or observations that have possible outcomes, according to hi , that alternative hj says are impossible. The second part of the theorem applies to all other experiments or observations.
5.3
Likelihood Ratio Convergence under Conditions where Falsifying Outcomes are Possible
The first part of the Likelihood Ratio Convergence Theorem applies whenever some of the experiments or observations in sequence cn have possible outcomes with non-0 likelihoods on hypothesis hi , but 0 likelihoods on alternative hj . Such outcomes are highly desirable. If they occur, the likelihood ratio comparing hj to hi will be 0, and hj will be falsified. A crucial experiment is a special case of this, the case where, for at least one possible outcome oku , P [oku |hi ·b·ck ] = 1 and P [oku |hj ·b·ck ] = 0. In the more general case hi together with b says that one of the outcomes of ck is at least minimally probable, whereas hj says that outcome is impossible: P [oku |hi ·b·ck ] > 0 and P [oku |hj ·b·ck ] = 0. Likelihood Ratio Convergence Theorem Part 1: The Falsification Theorem:43 Suppose cm , a subsequence of the whole evidence sequence cn , consists of experiments or observations with the following property: there are outcomes oku of each ck in cm deemed impossible by hj · b but deemed possible by hi · b to at least some small degree δ. That is, suppose there is some δ > 0 such that for each ck in cm , P [∨{oku : P [oku |hj ·b·ck ] = 0} | hi ·b·ck ] ≥ δ. Then, P [∨{en : P [en |hj ·b·cn ]/P [en |hi ·b·cn ] = 0} | hi ·b·cn ] = P [∨{en : P [en |hj ·b·cn ] = 0} | hi ·b·cn ]
≥
1 − (1 − δ)m
which approaches 1 for large m. In other words, suppose hi says observation ck has at least a small likelihood of producing one of the outcomes oku that hj says is impossible — that is, P [∨{oku : P [oku |hj ·b·ck ] = 0} | hi ·b·ck ] ≥ δ > 0. And suppose that some number m of experiments or observations are of this kind. If the number of such observations is large enough, and hi (together with b·cn ) is true, then it is highly likely that one of the outcomes held to be impossible by hj will occur, and the likelihood ratio of hj over hi will then become 0. Bayes’ theorem then goes on to 43 For proof see [Hawthorne, 2004] http://plato.stanford.edu/entries/logic-inductive/ supplement4.html.
Confirmation Theory
379
imply that when this happens, hj is absolutely refuted — its posterior probability becomes 0. The Falsification Theorem is very commonsensical. First, notice that when there is a crucial experiment in the evidence stream, the theorem is completely obvious. That is, suppose for the specific experiment ck (in evidence stream cn ) there are two incompatible possible outcomes okv and oku such that P [okv |hj ·b·ck ] = 1 and P [oku |hi ·b·ck ] = 1. Then, clearly, P [∨{oku : P [oku |hj ·b·ck ] = 0} | hi ·b·ck ] = 1, since oku is “one of the oku such that P [oku |hj ·b·ck ] = 0”. So where there is a crucial experiment available, the theorem applies with m = 1 and δ = 1. The theorem is equally commonsensical when there is no crucial experiment. To see what it says in such cases, consider an example. Let hi be some theory that implies a specific rate of proton decay, but a rate so low that there is only a very small probability that any particular proton will decay in a given year. Consider an alternative theory hj that implies that protons never decay. If hi is true, then for a persistent enough sequence of observations (i.e., if proper detectors can be built and billions of protons kept under observation for long enough), eventually a proton decay will almost surely be detected. When this happens, the likelihood ratio becomes 0. Thus, the posterior probability of hj becomes 0. It may be instructive to plug some specific numbers into the formula given by the Falsification Theorem, to see what the convergence rate might look like. For example, the theorem tells us that if we compare any pair of hypotheses hi and hj on an evidence stream cn that contains at least m = 19 observations or experiments, each having δ ≥ .10 for the likelihood of yielding a falsifying outcome, then the likelihood (on hi ·b·cn ) of obtaining an outcome sequence en that will yield a likelihood ratio P [en |hj ·b·cn ]/P [en |hi ·b·cn ] = 0 must be least 1−(1−.1)19 = .865. A comment about the need for, and usefulness of such convergence theorems is in order, now that we’ve seen one. Given some specific pair of scientific hypotheses hi and hj , one may always directly compute the likelihood, given (hi · b · cn ), that any specific sequence of experiments or observations cn will result in one of the specific sequences of outcomes that yields low likelihood ratios. So, given a specific pair of hypotheses and a proposed sequence of experiments, we don’t need a general Convergence Theorem to tell us the likelihood of obtaining refuting evidence. The specific hypotheses hi and hj tell us this themselves. Indeed, they tell us the likelihood of obtaining each specific outcome stream, including those that refute the competitor or produce a very small likelihood ratio for it. Furthermore, after we’ve actually performed an experiment and recorded its outcome, all that matters is the actual ratio of likelihoods for that outcome. Convergence theorems become moot. The point of Likelihood Ratio Convergence Theorem (both the Falsification Theorem, and the other part of the theorem, still to come) is to assure us in advance of the consideration of any specific pair of hypotheses that if the possible evidence streams that test hypotheses have certain characteristics that reflect the empirical distinctness of the hypotheses, then it is highly likely that one of the sequences of outcomes will occur that results in a very small likelihood ratio.
380
James Hawthorne
These theorems provide relatively loose, finite lower bounds on how quickly such convergence is likely to be. Thus, the only point of such convergence theorems is to assure us, in advance of our using the logic of confirmation to test specific hypotheses, that this logic is likely to do what we want it to do — i.e., to result in the refutation of empirically distinct false alternatives to the true hypothesis, and to generate a high degree of positive confirmation for the true hypothesis.
5.4
Likelihood Ratio Convergence under Conditions where No Falsifying Outcomes are Possible
The Falsification Theorem shows what happens when the evidence stream includes possible outcomes that may falsify the alternative hypothesis. But what if no possibly falsifying outcomes are present? That is, what if hypothesis hj only specifies various non-zero likelihoods for possible outcomes? Or what if hj does specify 0 likelihoods for some outcomes, but only for those that hi says are impossible as well? Such evidence streams are undoubtedly much more common in practice than those containing possibly falsifying outcomes. To cover evidence streams of this kind we first need to identify a useful way to measure the degree to which hypotheses are empirically distinguishable by such evidence. Consider some particular sequence of outcomes en that results from observations cn . The likelihood ratio P [en |hj ·b·cn ]/P [en |hi ·b·cn ] measures the extent to which that outcome sequence distinguishes between hi and hj . But, as a measure of the power of evidence to distinguish among hypotheses, likelihood ratios themselves provide a rather lopsided scale, a scale that ranges from 0 to infinity with the midpoint, the point where en doesn’t distinguish at all between hi and hj , at 1. So, rather than using raw likelihood ratios to measure the ability of en to distinguish between hypotheses, it proves more useful to employ a symmetric measure. The logarithm of the likelihood ratio provides just such a measure. Definition: QI — the Quality of the Information. For each experiment or observation ck , define the quality of the information provided by possible outcome oku for distinguishing hj from hi , given b, as follows (where we take the log to be base 2): QI[oku |hi /hj |b·ck ] = log[P [oku |hi ·b·ck ]/P [oku |hj ·b·ck ]]. Similarly, define QI[en |hi /hj |b·cn ] = log[P [en |hi ·b·cn ]/P [en |hj ·b·cn ]]. We measure the Quality of the Information an outcome would yield in distinguishing between two hypotheses as the base-2 logarithm of the likelihood ratio. This is clearly a measure of the outcome’s evidential strength at distinguishing between the two hypotheses. By this measure, two hypotheses, hi and hj , assign the same likelihood value to a given outcome oku just in case QI[oku |hi /hj |b·ck ] = 0. Furthermore, because the log is base 2, when P [oku |hi · b · ck ]/P [oku |hj · b · ck ] = 2r , QI[oku |hi /hj |b·ck ] = r, for any real number r. So, QI measures information on a
Confirmation Theory
381
logarithmic scale that is symmetric about the natural no-information midpoint, 0. Positive information (r > 0) favors hi over hj and negative information (r < 0) favors hj over hi . Given the Independent Evidence Assumptions it is easy to see that relative to each hypothesis (with background), hi · b and hj · b, the QI for a sequence of outcomes is just the sum of the QIs of the individual outcomes in the sequence: (4)
QI[en |hi /hj |b·cn ] =
n X
k=1
QI[ek |hi /hj |b·ck ].
QI only measures the amount by which each specific outcome counts for or against the two hypotheses. But what we want to know is something about how the experiment or observation as a whole tends to produce distinguishing outcomes. The expected value of QI turns out to be very helpful in this regard. The expected value of a quantity is gotten by first multiplying each of its possible values by its probability of occurring, and then summing up these products. Thus, the expected value of QI is given by the following formula: Definition: EQI — the Expected Quality of the Information. Let’s call hj outcome-compatible with hi on evidence stream ck just when for each possible outcome sequence ek of ck , if P [ek |hi ·b·ck ] > 0, then P [ek |hj · b · ck ] > 0. We also adopt the convention that if P [oku |hj ·b·ck ] = 0, then the term QI[oku |hi /hj |b·ck ]·P [oku |hi·b·ck ] = 0, since the outcome oku has 0 probability of occurring given hi ·b·ck . For hj outcome-compatible with hi on ck , define P EQI[ck |hi /hj |b] = u QI[oku |hi /hj |b·ck ]×P [oku |hi ·b·ck ]. P Also, define EQI[cn |hi /hj |b] = en ∈E n QI[en |hi /hj |b]×P [en |hi ·b·cn ].
Notice that whenever hj is not outcome-compatible with hi on evidence stream cm , Part 1 of the Likelihood Ratio Convergence Theorem, the Falsification Theorem applies. The EQI of an experiment or observation is the Expected Quality of its Information for distinguishing hi from hj when hi is true. It is a measure of the expected evidential strength of the possible outcomes of an experiment or observation at distinguishing between the hypotheses. Whereas QI measures the ability of each particular outcome or sequence of outcomes to empirically distinguish hypotheses, EQI measures the tendency of experiments or observations to produce distinguishing outcomes. EQI tracks empirical distinctness in a very precise way, as we’ll see in a moment. The EQI for a sequence of observations cn turns out to be just the sum of the EQIs of the individual observations ck in the sequence:44 44 For a derivation see [Hawthorne, logic-inductive/supplement5.html.
2004]
http://plato.stanford.edu/entries/
382
(5)
James Hawthorne
n
EQI[c |hi /hj |b] =
n X
k=1
EQI[ck |hi /hj |b]
This suggests that it may be useful to average the values of the EQI[ck |hi /hj |b] over the number of observations n. We then obtain a measure of the average expected quality of the information from the experiments and observations that make up cn . Definition: EQI — The Average Expected Quality of Information. The average expected quality of information, EQI, from cn for distinguishing hj from hi , given hi ·b, is defined as: EQI[cn |hi /hj |b] = EQI[cn |hi /hj |b] / n. This definition together with equation (16) yields the following: (6) EQI[cn |hi /hj |b] = (1/n)×
n X
k=1
EQI[ck |hi /hj |b]
It turns out that the value of EQI[ck |hi /hj |b] cannot be less than 0; and it will be greater than 0 just in case hi is empirically distinct from hj on at least one outcome oku — i.e., just in case for at least one oku , P [oku |hi · b · ck ] 6= P [oku |hj · b · ck ]. The same goes for the average, EQI[cn |hi /hj |b]. Theorem: Nonnegativity of EQI.45 EQI[ck |hi /hj |b] ≥ 0; and, EQI[ck |hi /hj |b] > 0 if and only if for at least one of its possible outcomes oku , P [oku |hi ·b·ck ] 6= P [oku |hj ·b·ck ].
Also, EQI[cn |hi /hj |b] ≥ 0; and EQI[cn |hi /hj |b] > 0 if and only if at least one experiment or observation ck has at least one possible outcome oku such that P [oku |hi ·b·ck ] 6= P [oku |hj ·b·ck ].
In fact it can be shown that increasing the fineness of the partition of the outcome space Ok = {ok1 , . . ., okv , . . ., okw } by breaking it up into more distinct outcomes (if it can be so divided) always results in a larger value for EQI, provided that at least some of the additional outcomes have distinct likelihood ratio values.46 EQI tracks empirical distinctness in a very precise way. The importance of the Non-negativity of EQI result for the Likelihood Ratio Convergence Theorem will become apparent in a moment. We are now in a position to state the second part of the Likelihood Ratio Convergence Theorem. It applies to all evidence streams that do not contain possibly falsifying outcomes for hj when hi holds — i.e., it applies to all evidence streams for which hj is outcome-compatible with hi on each ck in the stream. 45 For proof see [Hawthorne, 2004] http://plato.stanford.edu/entries/logic-inductive/ supplement6.html 46 See [Hawthorne, 2004] http://plato.stanford.edu/entries/logic-inductive/ supplement6.html
Confirmation Theory
383
Likelihood Ratio Convergence Theorem Part 2: The Probabilistic Refutation Theorem.47 Let γ > 0 be any number smaller than 1/e2 (≈ .135; where this ‘e’ is the base of the natural logarithm). And suppose that for each possible outcome oku of each observation condition ck in cn , either P [oku |hi ·b·ck ] = 0 or P [oku |hj · b · ck ]/P [oku |hi · b · ck ] ≥ γ. Choose any positive ε < 1, as near to 0 as you like, but large enough that (for the number of observations n being contemplated) the value of EQI[cn |hi /hj |b] > −(log ε)/n. Then P [∨{en : P [en |hj ·b·cn ]/P [en |hi ·b·cn ] < ε} | hi ·b·cn ] > 1 − (1/n) ×
(EQI[cn |h
(log γ)2 2 i /hj |b] + (log ε)/n)
which approaches 1 for large n when EQI[cn |hi /hj |b] has a positive lower bound — i.e., when the sequence of observation cn has an average expected quality of information (for empirically distinguishing hj from hi ) that doesn’t diminish towards 0 as the evidence sequence increases. This theorem provides a very reasonable sufficient condition for the likely refutation of false alternatives via exceedingly small likelihood ratios. The condition under which this happens draws only on a characterization of the degree to which the hypotheses involved are empirically distinct from each other. The theorem says that when these conditions of empirical distinctness are met, hypothesis hi (together with b · cn ) provides a likelihood that is at least within (1/n) × (log γ)2 /(EQI[cn |hi /hj |b] + (log ε)/n)2 of 1 that some outcome sequence en will occur that yields a likelihood ratio smaller than chosen ε. It turns out that in almost every case the actual likelihood of obtaining such evidence will be much closer to 1 than this factor indicates. Thus, this theorem provides a rather loose lower bound on the likelihood of obtaining small likelihood ratios. It shows that the larger the value of EQI for an evidence stream, the more likely it is that the stream will produce a sequence of outcomes that yield very small likelihood ratios. But even if EQI remains quite small, a long enough stream, n, will almost surely do the trick.48 47 For a proof see [Hawthorne, 2004] http://plato.stanford.edu/entries/logic-inductive/ supplement7.html 48 It should now be clear why the boundedness of EQI above 0 is important. Convergence Theorem 2 applies only when EQI[cn |hi /hj |b] > −(log ε)/n. But this requirement is not a strong assumption. For, the Nonnegativity of EQI Theorem shows that the empirical distinctness of two hypotheses on a single possible outcome suffices to make the average EQI positive for the whole sequence of experiments. So, given any small fraction ε > 0, the value of −(log ε)/n (which is greater than 0) will eventually become smaller than EQI, provided that the degree to which the hypotheses are empirical distinct for the various observations ck does not on average degrade too much as the length n of the evidence stream increases. This seems a reasonable condition on the empirical distinctness of hypotheses.
384
James Hawthorne
Notice that the antecedent condition of the theorem, that “either P [oku |hi ·b·ck ] = 0 or P [oku |hj ·b·ck ]/P [oku |hi ·b·ck ] ≥ γ, for some γ > 0 but γ less than 1/e2 (≈ .135)”, does not favor hypothesis hi in any way. This condition only rules out the possibility that some outcomes might furnish extremely strong evidence against hj relative to hi . This condition is only needed because our measure of the evidential distinguishability of pairs of hypotheses, QI, blows up when the likelihood ratio P [oku |hj ·b·ck ]/P [oku |hi ·b·ck ] is extremely small. Furthermore, this condition is really no restriction at all on the application of the theorem to possible experiments or observations. If ck has some possible outcome description oku that would make P [oku |hj ·b·ck ]/P [oku |hi ·b·ck ] < γ (for a given small γ of interest), one may disjunctively lump oku together with some other outcome description okv for ck . Then, the antecedent condition of the theorem will be satisfied, but with the sentence ‘(oku ∨ okv )’ treated as a single outcome in the formula for EQI. It can be proved that the only effect of such “disjunctive lumping” is to make EQI a bit smaller than it would otherwise be. If, when the evidence is actually collected, such a “too refuting” outcome oku actually occurs, so much the better. We merely failed to take this possibility for refutation into account in computing our lower bound on the likelihood that refutation via likelihood ratios will occur. The point of the two Convergence Theorems explored in this section is to assure us, in advance of considering any specific pair of hypotheses, that if the possible evidence streams that test pairs of hypotheses against each other have certain characteristics which reflect their evidential distinguishability, it is highly likely that outcomes yielding small likelihood ratios will result. These theorems provide finite lower bounds on how quickly convergence is likely to occur, bounds that show one need not wait for convergence through some infinitely long run. Indeed, for any evidence sequence in which the probability distributions are at all well behaved, the actual likelihood of obtaining outcomes that yield small likelihood ratio values will inevitably be much higher than the lower bounds given by Parts 1 and 2 of the Theorem. In sum, according to the Theorem, each hypothesis hi says, via likelihoods, the following: “given enough observations, I am very likely to dominate my empirically distinct rivals in a contest of likelihood ratios.” Even a sequence of observations with an extremely low average expected quality of information is very likely to do the job, provided that the sequence is long enough. Presumably, in saying this, the true hypothesis speaks truthfully, and its false competitors lie. Thus (by Equation 9), as evidence accumulates, the degree of confirmation for false hypotheses will very probably approach 0, which will indicate that they are probably false; and as this happens, (by Equations 10 and 11) the degree of confirmation of the true hypothesis will approach 1, indicating its probable truth. 6
WHEN THE LIKELIHOODS ARE VAGUE AND/OR DIVERSE
Up to this point I’ve been supposing that likelihoods possess objective or agreed numerical values. Although this supposition is often satisfied in scientific contexts,
Confirmation Theory
385
there are important settings where it is unrealistic, where individuals are pretty vague about the numerical values of likelihoods, even though the evidence seems to weigh strongly against one hypothesis and in support of another. So let’s see how the supposition of precise, agreed values for likelihoods may be relaxed in a reasonable way. Let’s first consider an example of evidence for an important scientific hypothesis where the likelihoods are vague. Consider the following drift hypothesis: the land masses of Africa and South America were once joined together, then split and have drifted apart over the eons. Let’s compare it to an alternative contractionist hypothesis: the continents have fixed positions acquired when the earth first formed, cooled and contracted into its present configuration. On each of these hypotheses, how likely is it that: (1) the shape of the east coast of South America should match the shape of the west coast of Africa as closely as it in fact does; (2) the geology of the two coasts should match up so well; (3) the plant and animal species on these distant continents should be as similar as they are. One may not be able to determine anything like precise numerical values for such likelihoods. But experts readily agree that each of these observations is much more likely on the drift hypothesis than on the contractionist hypothesis. Jointly these observations constitute very strong evidence in favor of drift over its contraction alternative. On a Bayesian analysis this is due to the fact that experts in the scientific community widely agree (at least implicitly) that the ratio of the likelihoods strongly favors drift over contraction. As equations 9-11 show, this suffices to strongly refute the contractionist hypothesis with respect to the drift hypothesis (unless the contractionist hypothesis is taken to be quite a bit more plausible than the drift hypothesis on other grounds).49 49 Historically the case for continental drift is somewhat more complicated. Geologists tended to largely dismiss the evidence referred to above until the 1960s. Although this evidence may seem to be quite strong, it was unconvincing because it was not sufficiently strong to overcome certain non-evidential plausibility considerations that made the drift hypothesis seem extremely implausible — much less plausible that the more traditional contraction view. The chief problem was that there appeared to be no plausible mechanism by which drift might occur. It was argued that no known force or mechanism could push or pull the continents apart, and that the less dense continental material cannot possibly push through the denser material that makes up the ocean floor. These objections were eventually overcome when a plausible mechanism was articulated — i.e. that the continental crust floats atop molten material and moves apart as convection currents in the molten material carry it along. The case was pretty well clinched when evidence for this mechanism was found in the form of spreading zones containing alternating strips of magnetized material at regular distances from mid-ocean ridges. The magnetic alignments of materials in these strips correspond closely to the magnetic alignments found in magnetic materials in dateable sedimentary layers at other locations on the earth. These magnetic alignments indicate time periods when the direction of earth’s magnetic field has reversed. This gave geologists a way of measuring the rate at which the sea floor might spread, and the continents move apart. Although geologists may not be able to determine anything like precise values for the likelihoods of any of this evidence on each of the alternative hypotheses, the evidence is universally agreed to be much more likely on the drift hypothesis than on the contractionist alternative. Also, with the emergence of a plausible mechanism, the drift hypothesis no longer seems so overwhelmingly implausible due to non-evidential considerations. Thus, the weight of a likelihood ratio may be objective or public enough to strongly support a hypothesis over an alternative even in cases
386
James Hawthorne
Recall now the reasons given earlier for the desirability of agreement or near agreement on values for likelihoods in scientific contexts. I argued that to the extent members of a scientific community disagree on the values of the likelihoods, they disagree about the empirical content of their hypotheses — about what each hypothesis says the world is like. Such disagreement about empirical import may result in widely disparate assessments regarding which hypotheses are favored or disfavored by a given body of evidence. Similarly, to the extent that the values of likelihoods are vague for an individual agent, he or she may be unable to determine which of several hypotheses is favored or disfavored by a given body of evidence. Notice, however, that on a Bayesian account of confirmation the values of individual likelihoods are not really crucial to the way evidence sorts among hypotheses. Rather (as Equations 9-11 show), it is ratios of likelihoods that do the heavy lifting. So, even if two confirmation functions Pα and Pβ disagree on the values of likelihoods, they may, nevertheless, largely agree on the refutation or support that accrues to various rival hypotheses provided that the following Directional Agreement Condition is satisfied: Directional Agreement Condition: The likelihood ratios due to each of a pair of confirmation functions Pα and Pβ will be said to agree in direction (with respect to the possible outcomes of experiments or observations relevant to a pair of hypotheses) just in case each of the following conditions hold: • for each possible outcome ek of experiments and observations ck in the evidence stream, Pα [ek |hj · b · ck ]/Pα [ek |hi · b · ck ] < 1 just in case Pβ [ek |hj · b · ck ]/Pβ [ek |hi · b · ck ] < 1; and Pα [ek |hj · b · ck ]/Pα [ek |hi · b · ck ] > 1 just in case Pβ [ek |hj · b · ck ]/Pβ [ek |hi · b · ck ] > 1.
• each of these likelihood ratios is either close to 1 for neither confirmation function or for both functions.
When this condition holds, the evidence will support hi over hj according to Pα just in case it does so for Pβ as well, although the strength of support may differ somewhat. Furthermore, although the rate at which the likelihood ratios increase or decrease as evidence accumulates may differ for these confirmation functions, the total impact of the cumulative evidence will ultimately affect the refutation and support of hypotheses in much the same way for each function. Thus, when likelihoods are vague or diverse, we may take the approach we employed for vague and diverse prior plausibility assessments. We may represent the vagueness in an agent’s assessments of both prior plausibilities and likelihoods in terms of a vagueness set — a set of confirmation functions that covers the range of values that are acceptable to the agent. Similarly, we may extend the diversity sets for communities of agents to include confirmation functions for both the where precise values for likelihoods cannot be determined.
Confirmation Theory
387
range of likelihoods and the range of prior plausibilities (from individual vagueness sets) that represent the considered views of the members of the relevant scientific community. The Likelihood Ratio Convergence Theorem can still do its work in this context, provided that the Directional Agreement Condition is satisfied by all confirmation functions in these extended vagueness and diversity sets. The proof of the theorem doesn’t depend on supposing that likelihoods are objective or have intersubjectively agreed values. The theorem may be applied to each confirmation function Pα individually. The only real difficulty that comes from applying the theorem to a range of confirmation functions that disagree on the values of likelihoods is that the specific outcome sequences that strongly favors hi according to Pα may instead strongly favor hj according to Pβ . However, when the Directional Agreement Condition holds for a family of confirmation functions, this kind of confirmational bifurcation cannot happen. Directional Agreement means that the empirical import of hypotheses as represented by Pα and Pβ is similar enough that each evidence sequence must favor the same hypotheses for both of functions. Thus, when the Directional Agreement Condition holds, if enough empirically distinguishing experiments or observations are forthcoming, all support functions in an extended vagueness or diversity set will very probably come to agree in assigning extremely small likelihood ratios to empirically distinct false competitors of a true hypothesis. As that happens, the community comes to agree on the refutation of these competitors, and the true hypothesis rises to the top of the heap.50 What if the true hypothesis has empirically equivalent rivals? Then their posterior probabilities must rise as well. The Likelihood Ratio Convergence Theorem only assures us that the disjunction of the true hypothesis with its empirically equivalent rivals will be driven to 1 as evidence lays low the empirically distinct rivals. The true hypothesis may itself approach 1 only if either it has no empirically equivalent rivals, or whatever equivalent rivals it has are also laid low by non-evidential plausibility considerations.
ACKNOWLEDGEMENTS Thanks to Prasanta Bandyopadhyay, Mark Gutel, Mary Gwin, Adam Morton, and an anonymous referee for their helpful comments on various drafts of this article.
BIBLIOGRAPHY [Carnap, 1950] R. Carnap. Logical Foundations of Probability, Chicago: University of Chicago Press, 1950. 50 Even if some part of the evidence gives rise to directionally disagreeing likelihood ratios (e.g., due to minor disagreements about empirical import), these may not interfere too much with agreement about evidential support for hypotheses, provided that the most substantial part of the evidence gives rise to overwhelmingly powerful directionally agreeing likelihood ratios.
388
James Hawthorne
[Carnap, 1952] R. Carnap. The Continuum of Inductive Methods, Chicago: University of Chicago Press, 1952. [Carnap, 1971] R. Carnap. A Basic System of Inductive Logic, Part I, in R. Carnap and R. C. Jeffrey (eds.), Studies in Inductive Logic and Probability, Vol. 1, Berkeley: University of California Press, 1971. [Carnap, 1980] R. Carnap. A Basic System of Inductive Logic, Part II, in R. C. Jeffrey (ed.), Studies in Inductive Logic and Probability, Vol. 2, Berkeley: University of California Press, 1980. [De Finetti, 1937] B. De Finetti. La Pr´ evision: Ses Lois Logiques, Ses Sources Subjectives, Annales de l’Institut Henri Poincar´ e, 7, 1-68; translated as “Foresight. Its Logical Laws, Its Subjective Sources”, in H. E. Kyburg, Jr. and H. E. Smokler (eds.), 1980, Studies in Subjective Probability, Robert E. Krieger Publishing. [Duhem, 1906] P. Duhem. La theorie physique. Son objet et sa structure, Paris: Chevalier et Riviere; translated by P.P. Wiener, 1954, The Aim and Structure of Physical Theory, Princeton, NJ: Princeton University Press, 1906. [Earman, 1992] J. Earman. Bayes or Bust?, Cambridge, MA: MIT Press, 1992. [Edwards, 1972] A. W. F. Edwards. Likelihood, Cambridge: Cambridge University Press 1972. [Eells, 1985] E. Eells. Problems of Old Evidence, Pacific Philosophical Quarterly, 66, 283-302, 1985. [Fisher, 1922] R. A. Fisher. On the Mathematical Foundations of Theoretical Statistics, Philosophical Transactions of the Royal Society, series A, 309-368, 1922. [Fitelson, 1999] B. Fitelson. The Plurality of Bayesian Measures of Confirmation and the Problem of Measure Sensitivity, Philosophy of Science 66, S362-S378, 1999. [Fitelson, 2005] B. Fitelson. Inductive Logic, in J. Pfeifer and S. Sarkar (eds.), The Philosophy of Science: An Encyclopedia, Oxford: Routledge, 2005. [Goodman, 1955] N. Goodman. The New Riddle of Induction, Chapter 3 of Fact, Fiction, and Forecast, Cambridge, Mass.: Harvard University Press, 1955. [Gaifman, 1964] H. Gaifman. Concerning Measures in First Order Calculi, Israel Journal of Mathematics, 2, 1-18, 1964. [Gaifman and Snir, 1982] H. Gaifman and M. Snir. Probabilities Over Rich Languages, Journal of Symbolic Logic, 47, 495-548, 1982. [Glymour, 1980] C. Glymour. Theory and Evidence, Princeton: Princeton University Press, 1980. [Hacking, 1965] I. Hacking. Logic of Statistical Inference, Cambridge: Cambridge University Press, 1965. [H´ ajek, 2003a] A. H´ ajek. What Conditional Probability Could Not Be, Synthese 137, 273-323, 2003. [H´ ajek, 2003b] A. H´ ajek. Interpretations of the Probability Calculus, in the Stanford Encyclopedia of Philosophy, (Summer 2003 Edition), Edward N. Zalta (ed.), 2003. http: //plato.stanford.edu/archives/sum2003/entries/probability-interpret/. [H´ ajek, 2005] A. H´ ajek. Scotching Dutch Books? Philosophical Perspectives 19 (Epistemology), 139-151, 2005. [Hawthorne, 2004] J. Hawthorne. Inductive Logic, The Stanford Encyclopedia of Philosophy (Winter 2006 Edition), Edward N. Zalta (ed.), 2004. http://plato.stanford.edu/archives/ win2006/entries/logic-inductive/. [Hawthorne, 2005] J. Hawthorne. Degree-of-Belief and Degree-of-Support: Why Bayesians Need Both Notions, Mind 114, 277-320, 2005. [Howson, 1997] C. Howson. A Logic of Induction, Philosophy of Science 64, 268-290, 1997. [Howson and Urbach, 1993] C. Howson and P. Urbach. Scientific Reasoning: The Bayesian Approach, 2nd edition, Chicago: Open Court, 1993. [Jaynes, 1968] E. T. Jaynes. Prior Probabilities, I.E.E.E. Transactions on Systems Science and Cybernetics, SSC-4, 227-241, 1968. [Jeffrey, 1965] R. C. Jeffrey. The Logic of Decision, McGraw Hill; 1983, 2nd edition, Chicago: University of Chicago Press, 1965. [Jeffrey, 1987] R. C. Jeffrey. Alias Smith and Jones: The Testimony of the Senses, Erkenntnis 26, 391-399, 1987. [Jeffrey, 1992] R. C. Jeffrey. Probability and the Art of Judgment, New York: Cambridge University Press, 1992. [Jeffreys, 1939] H. Jeffreys. Theory of Probability, Oxford: Oxford University Press, 1939.
Confirmation Theory
389
[Joyce, 1999] J. Joyce. The Foundations of Causal Decision Theory, New York: Cambridge U. Press. 1999. [Joyce, 2003] J. Joyce. Bayes’ Theorem, in the Stanford Encyclopedia of Philosophy, (Summer 2003 Edition), Edward N. Zalta (ed.), 2003. http://plato.stanford.edu/archives/win2003/ entries/bayes-theorem/ [Joyce, 2004] J. Joyce. Bayesianism, in A. Mele and P. Rawling (eds.), The Oxford Handbook of Rationality, Oxford: Oxford University Press, 2004. [Joyce, 2005] J. Joyce. How Degrees of Belief Reflect Evidence, Philosophical Perspectives 19, 153-179, 2005. [Keynes, 1921] J. M. Keynes. A Treatise on Probability, Macmillan and Co. 1921. [Kyburg, 1977] H. Kyburg. Randomness and the Right Reference Class, Journal of Philosophy, 74, 501-520, 1977. [Lange, 1999] M. Lange. Calibration and the Epistemological Role of Bayesian Conditionalization, Journal of Philosophy, 96, 294-324, 1999. [Levi, 1967] I. Levi. Gambling with Truth, New York: Knopf, 1967. [Levi, 1977] I. Levi. Direct Inference, Journal of Philosophy, 74, 5-29, 1977. [Levi, 1978] I. Levi. Confirmational Conditionalization, Journal of Philosophy, 75, 730-737, 1978. [Levi, 1980] I. Levi. The Enterprise of Knowledge, Cambridge, Mass.: MIT Press, 1980. [Lewis, 1980] D. Lewis. A Subjectivist’s Guide to Objective Chance, in R. C. Jeffrey (ed.), Studies in Inductive Logic and Probability, Vol. 2, Berkeley: University of California Press, 263-293, 1980. [Maher, 1996] P. Maher. Subjective and Objective Confirmation, Philosophy of Science 63, 149174, 1996. [Maher, 2006] P. Maher. The Concept of Inductive Probability, Erkenntnis 65, 185-206, 2006. [Quine, 1953] W. V. Quine. Two Dogmas of Empiricism, in From a Logical Point of View, New York: Harper Torchbooks, 1953. [Ramsey, 1926] F. P. Ramsey. Truth and Probability, in Foundations of Mathematics and other Essays, R. B. Braithwaite (ed.), Routledge & P. Kegan, 1931, 156-198; reprinted in H. E. Kyburg, Jr. and H. E. Smokler (eds.), Studies in Subjective Probability, 2nd ed., R. E. Krieger Publishing Company, 1980, 23-52; reprinted in D. H. Mellor (ed.), Philosophical Papers, Cambridge: University Press, Cambridge, 1990. [Rosenkrantz, 1981] R. Rosenkrantz. Foundations and Applications of Inductive Probability, New YorkAtascadero, CA: Ridgeview Publishing, 1981. [Royall, 1997] R. Royall. Statistical Evidence: A Likelihood Paradigm, New York: Chapman & Hall/CRC. 1997. [Savage, 1954] L. J. Savage. The Foundations of Statistics, John Wiley, 1954. (2nd ed., New York: Dover 1972). [Savage et al., 1962] L. J. Savage et al. The Foundations of Statistical Inference, London: Methuen, 1962. [Scott and Krauss, 1966] D. Scott and P. Krauss. Assigning Probabilities to Logical Formulas, in J. Hintikka and P. Suppes (eds.), Aspects of Inductive Logic, Amsterdam: North Holland, 219-264, 1966. [Skyrms, 1984] B. Skyrms. Pragmatics and Empiricism, New Haven: Yale University Press, 1984. [Skyrms, 1986] B. Skyrms. Choice and Chance, 3rd ed., Belmont, CA: Wadsworth 1986. [Williamson, 2007] J. Williamson. Inductive Influence, British Journal for Phililosophy of Science 58, 689-708, 2007.
This page intentionally left blank
CHALLENGES TO BAYESIAN CONFIRMATION THEORY John D. Norton
1
INTRODUCTION
Proponents of Bayesian confirmation theory believe that they have the solution to a significant, recalcitrant problem in philosophy of science. It is the identification of the logic that governs evidence and its inductive bearing in science. That is the logic that lets us say that our catalog of planetary observations strongly confirms Copernicus’ heliocentric hypothesis; or that the fossil record is good evidence for the theory of evolution; or that the 3o K cosmic background radiation supports big bang cosmology. The definitive solution to this problem would be a significant achievement. The problem is of central importance to philosophy of science, for, in the end, what distinguishes science from myth making is that we have good evidence for the content of science, or at least of mature sciences, whereas myths are evidentially ungrounded fictions. The core ideas shared by all versions of Bayesian confirmation theory are, at a good first approximation, that a scientist’s beliefs are or should conform to a probability measure; and that the incorporation of new evidence is through conditionalization using Bayes’ theorem. While the burden of this chapter will be to inventory why critics believe this theory may not be the solution after all, it is worthwhile first to summarize here the most appealing virtues of this simple account. There are three. First, the theory reduces the often nebulous notion of a logic of induction to a single, unambiguous calculus, the probability calculus. Second, the theory has proven to be spacious, with a remarkable ability to absorb, systematize and vindicate what elsewhere appear as independent evidential truisms. Third is its most important virtue, an assurance of consistency. The larger our compass, the more we must digest evidence of diverse form and we must do it consistently. Most accounts of evidence provide no assurance of consistency in their treatment of larger bodies of evidence.1 Bayesian confirmation theory affords us a simple picture: the entire bearing of evidence at any moment is captured by a probability distribution. No matter how large a body of evidence we contemplate, 1 Some even fail in simple cases. Consider an enumerative induction on the white swans of Europe, which lets us conclude that all swans are white; and an enumerative induction on the black swans of Western Australia, which leads us to the contradictory conclusion that all swans are black.
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
392
John D. Norton
we will not be led to contradictions in our evidential judgments as long as we form and update our beliefs in conformity with the probability calculus. Perhaps because of these virtues, Bayesian confirmation theory now enjoys the status of the leading account of induction and confirmation in the philosophy of science literature. Those who expect this circumstance to persist should recall that the present success of Bayesianism is relatively momentary. The probability calculus was born over 350 years ago in the seventeenth century in correspondence between Pascal and Fermat. One hundred years later, in the eighteenth century, the Reverend Thomas Bayes published his theorem as part of a proposal that probability theory be used to answer Hume’s inductive skepticism. Nonetheless, the idea that the probability calculus afforded the right way to understand inductive inference remained a minority view. The dominant view of induction in the nineteenth century followed in the tradition of Bacon and his tables, with its most influential expression in John Stuart Mills’ System of Logic (1891). This dominance persisted through to the mid twentieth century.2 It competed with the hypothetico-deductive approach to inductive inference. That view’s venerable history can be traced through Descartes’ method of hypothesis to ancient Athens were astronomers were, as legend has it, asked by Plato to find the geometrical planetary constructions that would “save the phenomena” of astronomy [Duhem, 1969]. As recently as a few decades ago, the philosophy of science literature was filled with proposals for dealing with Hempel’s [1945] notorious “paradox of the raven.” That paradox arose within a development of Hempel’s “satisfaction” criterion of confirmation, that was itself merely the ancient notion of enumerative induction transported from the context of Aristotle’s syllogistic logic to modern, first order predicate logic. These differing approaches played out against persistent claims by Popper [1959] and his followers that the very notion of a logic of induction is mistaken. The Bayesian approach to induction and confirmation rose to dominance in philosophy of science only slowly in the course of the second half of the twentieth century,3 with its importance codified by such works as Howson and Urbach [2006]. Given this history of competing traditions in induction and confirmation rising and falling over the centuries, it seems only prudent to expect that the Bayesian approach will recede from its present prominence and once again be merely one of several useful instruments for assessing inductive inference relations. The goal of this chapter is to review the weaknesses recounted in the literature that may drive this decline. It will draw on two sources. First are those outside the Bayesian fold who present direct challenges; second are shortcomings identified by Bayesians, who in turn have sought modifications within the Bayesian system. These two 2 Mill’s
methods remain to this day the single most important methodological idea for computer repair people and auto mechanics seeking to diagnose problems. Their reach persists in the methodological literature at least to Skyrms [1975], whose Ch. IV gives a detailed development. See the historical survey of Blake, Ducasse and Madden [1960] for a broader sense of the minor role probability theory has played in the longer term history of ideas of scientific methodology. 3 Glymour [1980, p. 64] identifies the influence of Carnap [1950] as decisive.
Challenges to Bayesian Confirmation Theory
393
literatures overlap sufficiently for it to be impractical for me to disentangle them.4 The survey will proceed from larger to smaller issues. Section 2 will review global challenges to Bayesian confirmation theory. These challenges derive from differences in the basic conception of inductive inference and the principles that govern it. They include the view that notions of inductive inference are merely artifacts subservient to whatever methods may reliably get us to the truth; and the proposal that there is no universally valid logic of induction, but only localized logics adapted to individual domains. Sections 3, 4 and 5 are based on a decomposition of Bayesian confirmation theory into independent components, which are presented as intuitive notions pertaining to confirmation. Additivity amounts to a decision that degrees of probability span belief and disbelief, not belief and ignorance; the essence of Bayesian dynamics is glossed as the incorporation of new evidence by a simple rule of “refute and rescale” prior belief. This decomposition supports a catalog of challenges to the specific Bayesian commitment. Among them is “the problems of the priors,” which proves to be multiple problems aggregated under one label. One — here called first problem — is that additivity precludes prior probabilities that represent a state of complete ignorance. The second problem derives from the simplicity of refute and rescale dynamics. It precludes specification of a neutral prior probability distribution that exerts little influence on the future course of conditionalization. Section 6 reports some further challenges and Section 7 is a brief conclusion. Some topics lie largely outside the compass of this chapter. In statistical practice, as opposed to philosophy of science, an epochal battle continues to be waged between proponents of traditional Neyman-Pearson hypothesis testing and Bayesian approaches. That debate has been of far less importance in philosophy of science and will be addressed by Deborah Mayo and Aris Spanos (this volume).5 Within the Bayesian literature itself, the dominant problem has been to find the correct interpretation of probability: subjective, objective or logical? To someone who is at a slight distance from Bayesianism, the debate seems misdirected in setting standards of precision for the interpretation of a central term of a theory that are rarely achieved elsewhere and are probably unachievable for probability. Its energy is reminiscent of an internecine feud among sect members who agree vastly more than they differ, but nonetheless tolerate no doctrinal deviations. This topic will be addressed only in so far as it arises in broader challenges to the Bayesian approach. These debates are addressed by Sandy Zabell (This Volume) and Philip Dawid (This Volume) as well as survey volumes [Galavotti, 2005; Gillies, 2000; Mellor, 2005].
4 Hence I apologize in advance to Bayesians who feel that their work has been mischaracterized as a challenge to Bayesianism. 5 For a brief account from the perspective of a philosopher sympathetic with the Bayesian approach, see [Howson, 1997]; and for a brief history from someone with opposing sympathies, see [Mayo, manuscript].
394
John D. Norton
2
COMPETING ACCOUNTS OF THE NATURE OF INDUCTIVE INFERENCE
Any theory of inductive inference depends upon one or more principles or presumptions that distinguish the right inductive inference relations. These principles must be there, whether they are made explicit, or, as is the more usual case, left tacit. The most fundamental of the challenges to Bayesian confirmation theory come from differing views of these principles.
2.1
Alternatives
Given the large number of approaches to inductive inference, one might imagine that there is a plethora of competing principles. A recent survey [Norton, 2005], however, shows that most accounts of inductive inference can be grouped into one of three families, there called “inductive generalization,” “hypothetical induction” and “probabilistic induction.” Each family is based upon a distinct inductive principal and the different members emerge from efforts to remedy the deficiencies of the principle. While it is impossible to give detailed descriptions of each family, it is useful here to indicate the breadth and longevity of the first two, which compete with the third probabilistic family. Accounts of induction belonging to inductive generalization depend on the principle that the instance of a hypothesis confirms the generalization. This ancient notion was implemented in syllogistic logic as enumerative induction: an A that is B confirms that all As are B. The principal weakness of this archetype of the family is that it may allow rather little to be inferred. We cannot use it to pass from the evidence of a 3o K cosmic background radiation to the theory of big bang cosmology. The embellishments of the archetype are devoted to extending the reach of enumerative induction. Hempel’s satisfaction criterion reformulates the basic notion within the richer, first order predicate logic. Mill’s methods extend it by licensing the inference to a new notion, cause. In the Method of Agreement, if A is seen to follow a, then we infer this pattern will persist generally and label a as the cause of A. Glymour’s “bootstrap” account of confirmation uses instance confirmation to enable evidence to confirm hypotheses that employ theoretical notions than are not explicitly mentioned in the evidence. For our purposes, what matters is that all these approaches to induction depend upon the basic principle that an instance confirms the generalization. Accounts of induction belonging to hypothetical induction depend on the principle that the ability of an hypothesis deductively to entail the evidence is a mark of its truth. The archetype of accounts is this family is the saving of the phenomena in astronomy. That some model of planetary motions correctly fits and predicts the observed positions of the planets is a mark of its truth. The principal weakness of this principle is that it assigns the mark too indiscriminately. There are many planetary systems — geocentric, heliocentric and more — that can save the phenomena of astronomical appearances. They are all equally assigned the mark of
Challenges to Bayesian Confirmation Theory
395
truth, even though there must be some sense in which one saves the appearances better than another. Embellishments of this basic notion are intended to rein in the reach of hypothetical induction by adding further conditions that must be met to earn the mark. The hypothesis that saves the phenomena must be simple, or simpler than its competitors; or we must in addition be assured of an exclusionary clause: that were the hypothesis that saves the phenomena false, then it would be unlikely for the phenomena to obtain.6 Other versions require hypotheses not just deductively to entail the evidence, but also to explain it. Ptolemy’s model of planetary motion entails that Venus and Mercury will always appear close to the sun. Copernicus’ model explains why: these planets orbit the sun. Finally, another sort of embellishment enjoins us to take into account the historical process through which the hypothesis was generated. It must be generated by a method known to be reliable. Generating hypotheses ad hoc is not judged reliable. Parapsychologists hypothesize retrospectively that their experiments failed in the presence of skeptics because of the unintended obstructive influence emanating from the skeptics. We are free to discount this as an ad hoc hypothesis.
2.2
Bayesian Annexation
All these accounts represent challenges to Bayesian confirmation theory in that they are grounded in and can be developed in a framework that does not need the familiar structures of Bayesian confirmation theory. The natural Bayesian rejoinder is that the two basic principles and their embellishments can be absorbed and vindicated by the Bayesian system. Whether one finds this possibility compelling depends on whether one finds this gain worth the cost of adopting the extra structures and commitments that Bayesian confirmation theory requires (such as will be discussed in Sections 3-5 below). The absorption and vindication can sometimes be quite successful. The clearest case is the success of the Bayesian analysis of when the success of an hypothesis H entailing true evidence E should count as inductive support for H [Earman and Salmon, 1992, §2.7]. A familiar application of Bayes’ theorem7 P (H|E) P (E|H) P (H) = P (∼ H|E) P (E| ∼ H) P (∼ H) shows that the degree of support is controlled essentially by the likelihood ratio P (E|H)/P (E| ∼ H). If it is high, so that the evidence is much less likely to come 6 One important form of this is Mayo’s [1996] error statistical approach to inductive inference in science. It requires evidence to provide a “severe test” of an hypothesis. The above exclusionary clause is realized in Mayo’s [1996, p. 180] severity requirement: “There is a very low probability that test procedure T would yield such a passing result, if [the hypothesis] H is false.” It of special interest here since its debate with Bayesian confirmation theory replicates in philosophy of inductive logic the tensions between Neyman-Pearson hypothesis testing and Bayesian approaches in statistics. For a Bayesian rejoinder, see [Howson, 1997, §4]. 7 The terms have their obvious meanings. For example, P (H|E) is the (posterior) probability of H conditioned on E; P (H) is the (prior) probability of H against the tacit background. See Section 5 below.
396
John D. Norton
about if H is false than if it is true, then the posterior probability P (H|E) increases correspondingly with respect to the prior probability P (H). If the likelihood ratio is close to one so that E is pretty much as likely to come about whether or not H is true, then P (H|E) enjoys no such increase. The analysis is appealing in that it vindicates the basic principle of hypothetical induction, as well as going beyond it in indicating when it will not work. Other efforts at absorption have been less successful. In Section 5 below, we shall the difficulties facing Bayesian efforts to explicate notions like predictive power and simplicity. The problem is that the simple structures of the Bayesian theory do not have enough resources to distinguish the case of an hypothesis H merely entailing evidence E from it entailing it in a virtuous way that merits confirmatory rewards. There also seems to be some troubling elasticity in the way that a Bayesian vindication of independent inductive norms is achieved. Sometimes it seems that almost any norm could be justified. The problem is illustrated by difficulties faced in a simple example due to Carnap, as summarized by Earman and Salmon [1992, pp. 85-89]. The traditional form of inductive inference called “example” is a weakening of enumerative induction that merely infers from instances to instances: these cases of x are X; so that new case of x will also be X. The illustration deals with finitely many individuals a, b, c, . . . that may carry a property F (written “F a” etc.) or fail to carry it (written “∼ F a” etc.) The individual possibilities are described by state descriptions, such as F a&F b& ∼ F c&. . .. Carnap initially assigned equal probability to each state description. The immediate outcome was that the argument form “example” failed. Learning that Fa and Fb are true leaves the probabilities of Fc and ∼ F c unaffected. That could be offered as a proof of the failure of “example,” if one was inclined against the argument form. Carnap, however, was not so inclined. He chose to readjust the probabilities on the state descriptions in a way that favored correlations between the properties, so that “example” ends up being vindicated. Closer reflection on Carnap’s illustration shows that the vindication (or otherwise) of the inductive argument form “example” does not depend on anything inherent in the probability calculus. Rather it depends upon additional assumptions we make. In this case it is our determination of whether the individuals under consideration are correlated in the loose sense that one individual carrying F tends to go with another also doing so. If we decide they are so correlated, then “example” is vindicated, whether we express the vindication formally within a probabilistic analysis or merely as a restatement of the informal notion of correlation. This suggests that the Bayesian vindication of inductive norms is less a matter of extracting them from the probability calculus and more one of our introducing them as independent assumptions that can be expressed in probabilistic language.8 8 Other examples are easy to find. The Bayesian analysis of hypothetical induction above depended essentially on our specifying as an independent assumption that the ratio P (E|H)/P (E| ∼ H) is large. That amounts to the external assumption that E would much
Challenges to Bayesian Confirmation Theory
397
That makes them far less impressive than they initially seemed. One virtue does persist: since we are likely to want to combine many possibly competing inductive principles, if we do it within the framework of the probability calculus, we have some assurance that they will be combined consistently.
2.3
Bayesian Foundational Principles
The strongest response to these challenges of other principled accounts would be for Bayesian confirmation theory to display its own principled foundation, justify it and argue for its superiority. However prospects here are not strong. Bayesian confirmation theory is distinctive among accounts of confirmation theory in that a great deal of effort as been expended in seeking to justify it. These efforts are usually associated with different interpretations of probability and vary markedly in character according to the interpretation. While the literature on this problem is enormous, there seems to be no vindication that is widely agreed to be successful. Rather the agreement lies only with the conclusion that has to be reached — that the probability calculus is the logic of induction. Each Bayesian seems to find his or her own path to this conclusion, often from widely diverging starting points and with critical dismissal of other starting points. A few examples will illustrate the divergences in the starting points and their difficulties. The most promising approach is associated with a relative frequency interpretation of probability, such as sought in [Salmon, 1966, pp. 83-96]. In it, the probability of a scientific hypothesis would simply be the relative frequency of true hypotheses in a reference class of hypotheses of the appropriate type. Since relative frequencies behave mostly like probabilities, it is easy to conclude that the probability calculus is the natural logic of induction, which is largely reduced to the simple task keeping count of truth in relevant reference classes. The proposal fails because of the insurmountable difficulty of defining which are the appropriate reference classes, let alone assuring that the appropriate relative frequencies are unique and well defined in the resulting infinite sets. Another popular approach seeks to identify necessary conditions that any inductive logic must meet. A version has recently been developed by Jaynes [2003] following Cox [1961]. The attendant interpretation of probability is objective, in the sense that, in any one circumstance, there is one correct probability assignment, even though we may not know it; and is “logical” in the sense that the probabilities are treated as degrees of a logical relationship of support. The approach is strong in that, if there is a unique logic of induction to be found, circumscribing it by an ever-narrowing fence of necessary conditions will isolate it. Its difficulty is that it needs to convince the reader of the necessity of a list of conditions: that belief comes in numerical degrees that are always comparable; that the degree of belief assigned some proposition must be a function of those assigned to its disjunctive parts; and further, more specialized suppositions. We more likely likely come about if H is the case than if it is not, which already pretty much gives us the principle wanted.
398
John D. Norton
shall dissect these sorts of conditions in Sections 3, 4 and 5 below and find that they seem necessary only if one believes from the start that degrees of belief are probabilities.9 Finally, in association with a subjective approach to probability, there are arguments that arrive at the probability calculus by urging that any distribution of belief not in accord with the probability calculus exposes us to the danger of a combination of wagers that assure a loss. These are the Dutch book arguments, best known from the work of de Finetti [1937]. In a related approach, such as Savage [1972], natural conditions on our actions are translated into conditions on beliefs. The first and most evident weakness of these proposals is that they require a quite substantial body of presumptions on our actions and preferences and the rules that should be used to relate them to our beliefs. The arguments can be defeated by denying a major presumption. I may simply be unwilling to bet; or I may never find a bet such that I am indifferent to which side I take; or I may harbor intransitive preferences.10 However there is a deeper problem. These approaches fundamentally change the question asked. It was: when is this observation good evidence for that hypothesis? The presumption of the question is that evidential support obtains independently of our beliefs and actions. These arguments then change the question in two steps that introduce dependencies on our beliefs and actions. First, the question of evidential support (Does the fossil record support evolutionary theory?) is replaced by the question of distributions of belief (How do we distribute our belief over evolutionary theory and its competitors?). At this stage, prior to consideration of 9 For example, the belief assigned to A&B is assumed to be a function of the belief assigned to A and to B given A. It is natural to assume this if one imagines all along that “belief” is just a code word for probabilities or frequencies. However, if one does not prejudge what “belief” must be, the assumption of this specific functional dependency is alien and arbitrary. For a natural counterexample that violates the rule, see [Norton, 2007, p. 144, fn. 6]. Similar difficulties arise for attempts to use “loss functions” or “scoring rules” as a means of vindicating the Bayesian system, such as in [Joyce, 1998]. They depend on using a scoring rule that picks out the subspace of additive measures as local optima in the larger space of all belief functions. Any such scoring rule must be selected carefully so as to enforce a coupling between belief in an outcome A and its negation ∼ A; for an arbitrary change in the belief assigned to A must force a suitable change in the belief in ∼ A to ensure that optimal scoring occurs only in the subspace of additive measures. That coupling is just the sort of dependency noted in Section 4.1 and in (A′ ) that is tantamount to choosing additivity. However natural they may seem, the specific conditions chosen to endow scoring rules with this particular property amount to a prior decision to interpret low belief values as disbelief rather than ignorance or some mix of disbelief and ignorance. In the latter case, belief in an outcome is no longer closely coupled to the belief in its negation. 10 H´ ajek [2008]) argues that four of the principal means of vindicating Bayesianism fail, each in the same way. The means are the Dutch book arguments, representation theorem arguments, calibration arguments and gradational accuracy arguments. Each depends on displaying a theorem that is interpreted to show that rationality forces Bayesianism. Each vindication, H´ ajek urges, overlooks a “mirror-image” theorem that shows the reverse, that rationality requires degrees of belief not to be probabilities. For example, the Dutch book theorem assures the existence of a combination of fair bets that forces a sure loss for you if your numerical beliefs are not probabilities. Its mirror-image is the “Czech book theorem.” It assures that a benevolent Czech bookie can find a combination of fair bets that assures a certain gain for you just if your numerical degrees of belief are not probabilities.
Challenges to Bayesian Confirmation Theory
399
actions, we are to accept that there is no one right distribution of belief. Yours is presumed as good as mine. Second, rules are provided for translating beliefs into actions, so that the consequences of our actions can be assessed and used to restrict how we distribute our beliefs, while in general never eliminating the possibility of different belief distributions. These rules make most sense if we are at a racetrack or engaged in an engineering project where beliefs must be translated into design decisions. They are strained if our concern is the question originally asked: whether, for example, the observed motion of the galaxies provides good evidence for the hypothesized heat death of the universe, billions of years in the future — a circumstance independent of our beliefs and actions. The two-step process has produced an answer to a question, but it is not the question with which we started.11
2.4
Is There a Unique Logic of Inductive Inference?
So far, all the approaches considered — Bayesian and otherwise — have presumed that there is a unique logic of induction to be found. Another type of challenge to Bayesian confirmation theory denies this. Rather these challenges portray the conditions surrounding individual problems as determining which are the appropriate inductive strategies and, from this perspective, it turns out that there is no single logic of induction to be identified. Rather than having principled grounds for identifying the right inductive logic, these approaches offer principled grounds for denying that such a thing is possible. A well articulated version of this approach is the learning theoretic paradigm, as developed in [Kelly, 1996]. It presumes that our goal is getting to the truth through some reliable method of discovery. At its simplest, an agent receives a potentially unlimited stream of data and must formulate hypotheses expressing the pertinent truths of the world, as the data accumulates. The central problem of the literature is to discern the sorts of strategies that will lead to convergence to the truth in the hypotheses formulated successively, with more rapid convergence prized. In certain simple worlds, these methods can be governed by familiar inductive notions. A data stream “day, night, day, night, day, . . . ” rapidly succumbs to the hypothesis that day follows night and night follows day. In this case, enumerative induction is all that an agent needs to formulate the true hypothesis quite early. Matters get more complicated as the problems get harder. Looking for regularities in what follows what, leads to limited statistical success if the problem is weather prediction at one location and the data stream is “rain, shine, shine, shine, rain, . . . ”; or stock market prediction with a data stream “up 3 points, up 2 points, down 20 points, up 3 points, . . . ” There is a small amount of inertia in both the weather and the stock market, so that conditions one day tend somewhat to be 11 For a spirited assault on Dutch book arguments, see [Bacchus et al., 1990]. They urge these arguments fail since an agent should simply refuse to accept a combination of bets that assures a loss; it is (pp. 504-505) “a matter of deductive logic and not of propriety of belief.” Schick’s [1986] view is similar.
400
John D. Norton
replicated the next. These methods fail completely in inhospitable worlds. If the data stream is derived from a well-designed roulette wheel, “red, black, black, red, red, . . . ,” no hypotheses on when red will follow black (other than chance) can succeed. And matters can be worse in a demonic universe, which delivers data maliciously tailored to deceive the particular agent in question. In this reliabilist paradigm, inductive norms are parasitic on convergence to the truth. It is simply a matter of ingenuity to conceive universes in which a given inductive norm will be fruitful or harmful. For example, a standard supposition of scientific methodology is that we should only advance hypotheses that are consistent with the data at hand. However it turns out that, once we allow for the computational limitations of agents, a demand for consistency can conflict with reliability. So it may be efficient for us to offer an hypothesis inconsistent with the evidence, even though most accounts of induction, including the Bayesian, insists the hypothesis has been refuted. For a critique of Bayesian confirmation theory from the reliabilist perspective, see [Kelly and Glymour, 2004]. The idea that inductive strategies will vary according to the domain at hand is the central idea of the “material theory of induction” [Norton, 2003]. Its name derives from the idea that the license of an inductive inference does not come from the form of the sentences involved, as it does in deductive logic. Rather the license comes from the subject matter of the inference. Therefore the inductive logic applicable will vary from domain to domain as the licensing facts change; there is no universally applicable logic of induction. In one restricted area, this notion of material facts fixing the inductive logic is already familiar to subjective Bayesians through Lewis’ [1980] “principal principle.” For processes for which objective chances are available, the principle enjoins subjective Bayesians to match their subjective degrees of belief to the physical chances. Quantum mechanical laws deliver chances for the timing of the decay of a radioactive atom. If we match our subjective degrees of belief concerning those timings to the quantum mechanical chances, then we are assured high belief in outcomes that are most likely to happen and low belief in those unlikely to happen. A by-product is that our degrees of belief automatically conform to the probability calculus.12 The principal principle naturally generalizes to the central claim of the material theory of induction: that the factual properties of the system under consideration will determine the appropriate inductive logic. It follows that there may be systems whose facts dictate that the applicable logic of induction is not probabilistic. Norton [2007, §8.3; forthcoming] describes certain physical systems whose governing laws are indeterministic in the sense that the complete specification of 12 The facts that license probabilistic inferences need not be the sorts of physical chances recovered from stochastic theories like quantum mechanics. For example, imagine that one is a bookmaker accepting wagers on a definite matter of presently unknown fact: say, whether some recently discovered infectious illness is viral or bacterial in origin. If the factual conditions surrounding the bookmaking conform to all those assumed in Dutch book arguments, then those facts determine that one’s degree of belief ought to conform to the probability calculus, on pain of exposure to a sure loss.
Challenges to Bayesian Confirmation Theory
401
the present state of the system does not fix the future state; yet these governing laws provide no probabilities for the various possible futures. They simply tell us that, given such-and-such a present state, this or that future state is possible. In the corresponding inference problem, we are required to distribute beliefs over these possible futures, knowing the state of the present. If we distributed beliefs as probabilities rather than in some weaker form, we end up with beliefs that outstrip the full physical specification as given by the initial conditions and the applicable physical laws. We must assert that one possible outcome is twice as probable as another; or as probable as another; or half as probable. That is, we must pretend to know more than the physical laws, which are only able to assert that each outcome is possible, without any sense of “twice as possible” or “half as possible.”
3
CHALLENGES TO FRAMEWORK ASSUMPTIONS
The challenges considered so far in Section 2 above have been global in the sense that they arise from tensions between Bayesian confirmation theory as a whole and other approaches to inductive inference. If we recall that Bayesian confirmation theory is a composite of many assumptions, then another class of challenges to Bayesian confirmation theory is identifiable. These are challenges to specific assumptions of Bayesian confirmation theory. In order to structure the inventory of these challenges, the assumptions comprising Bayesian confirmation theory will be divided into three parts, called: “Framework,” to be discussed in this section; “Additivity,” to be discussed in Section 4; and “Bayes’ dynamics: to be discussed in Section 5.13
3.1
Framework Assumptions
What is common to all versions of Bayesian confirmation theory is the assumption that there is a real-valued magnitude, P (H|E), that is one’s subjective degree of belief or some objective measure of support for the hypothesis H from evidence E. This apparently simple assumption of a real valued magnitude can be decomposed further into a series of assumptions, each of which presumes the one before: Precision. There is a magnitude P (A|B) that represents the degree of belief or support accrued to A given B. Universal comparability. It is always possible to compare two such degrees, finding one to be less than, equal to or greater than the other. Partial order. The comparison relation “no less than” ≤ is a partial 13 This disassembly is similar to the one effected in [Norton, 2007], which in turn draws on an extensive literature in qualitative probability. For surveys see [Fine, 1973; Fishburn, 1986].
402
John D. Norton
order: it is reflexive14 , antisymmetric15 and transitive16 . Real values. The degrees are real-valued. Different challenges would require us to halt at different stages as we pass up this hierarchy. Lest that seem odd, the notion that a cautious ascent is needed may be more palatable if we consider an analogous problem of assigning degrees of simplicity to hypotheses. The same hierarchy can be formed, but few are likely to ascend all the way to the idea of single real value that measures the degree of simplicity of all hypotheses we encounter. We may stall at the start, judging the very idea of a degree of simplicity inadmissible. Or we may allow comparison among closely related hypotheses only. Jeffreys [1961, p.47], for example, wished to assign greater prior probability to simpler hypotheses. So he defined the complexity m of certain differential equations as the sum of their order, degree and the absolute values of their coefficients. Unit prior probability was then distributed over the resulting complexity classes as a decreasing function of m, such as 2−m or 6/π 2 m2 . Jeffreys’ degree is defined only for a specific class of equations. It would not sustain universal comparability in that we could not compare the degree of simplicity of the hypothesis of a linear relationship between the velocity and distance to the galaxies with that of the hypothesis of the germ theory of disease.
3.2
The Very Idea.
The most fundamental objection is that there is something mistaken about the very idea of a degree of belief or support that evidence lends to a hypothesis. This objection can rest on the visceral sense that such assignments are simply spurious precision: that we can conjure up a number does not mean that we are measuring something. A colorful statement of the mismatch of Bayesian analysis with diagnostic clinical practice is Feinstein’s [1977] “The haze of Bayes, the aerial palaces of decision analysis, and the computerized Ouija board.” In a different direction, Glymour’s [1980, Ch. III] “problem of old evidence” speaks against the idea that the fundamental quantity P (A|B) presumed in the framework is an adequate basis for an account of inductive inferences. The import of evidence E on hypothesis H is gauged by comparing the prior probability P (H|B) with the posterior P (H|E&B), where B is the totality of background knowledge. It frequently happens that the evidence E is already part of our background knowledge. The celebrated case concerns the motion of Mercury, which, at the start of the 20th century, was know to deviate slightly from the predictions of Newtonian gravitation theory. In November 1915, Einstein found that his newborn general theory of relativity “H” predicted exactly the observed deviations 14 For
all admissible A, B, P (A|B) ≤ P (A|B). all admissible A, B, C and D, if P (A|B) ≤ P (C|D) and P (C|D) ≤ P (A|B), then P (A|B) = P (C|D). 16 For all admissible A, B, C, D, E and F , if P (A|B) ≤ P (C|D) and P (C|D) ≤ P (E|F ), then P (A|B) ≤ P (E|F ). 15 For
Challenges to Bayesian Confirmation Theory
403
“E”. The universal agreement is that E provided strong support for H. However, since E is part of the background then known to Einstein, we have E&B = B so that P (H|E&B) = P (H|B) which shows that E is evidentially inert. If this objection is sustained, it means that a confirmation theory employing only the quantities P (H|E) and P (H|E&B) cannot support the judgment that E is good evidence for H.17 The obvious escape is to replace the prior P (H|B) with an adjusted prior P (H|B ′ ), where B ′ is the background B with E somehow excised. Glymour’s original presentation sought to block this escape by urging that it is quite unclear how this excision could be effected. Other, more elaborate escapes are reviewed in [Earman, 1992, Ch. 5], including the notion that we assign evidential value to learning the logical relation between some hypothesis H and the evidence E that is already, unbeknown to us, part of our background B.
3.3
Universal Comparability
Once we allow that the degree P (A|B) is always defined, it may still not follow that all degrees are comparable. Keynes [1921, pp. 37-40] allows that all degrees will lie between those assigned to impossibility and certainty, but urges that not all intermediate degrees are mutually comparable. The clearest way that comparability can fail is when we conditionalize on very different sorts of backgrounds so that we are distributing belief over incommensurable possibilities. For example, Norton [2007, pp. 147-48] expands a concern expressed by Humphreys [1985] in the context of propensity interpretations to suggest that the direct degrees P (E|H) and the inverse P (H|E) may measure very different things. P (E|H) may be derived from the computing of a quantity in a stochastic theory: the chance of E, the decay of some particular radioactive atom in some time as deduced from H, the laws of quantum physics. The inverse P (H|E) assigns degrees to different physical laws on the basis of the evidence of radioactive decay. The first deals with possibilities licensed by a known theory and is grounded in the science; the second deals with speculation on how the world might admit different laws whose range is delimited by our imagination. Even if we can find protocols that allow us to assign numbers in each case, it is not clear that there is anything meaningful in the comparison of those numbers — just as it is meaningless to compare degree of temperature and degrees Baum´e of specific gravity, even though both are real numbers and arithmetic allows the 17 Glymour’s original presentation used Bayes’ theorem to deduce P (H|E&B) = P (H|B), although the problem is independent of Bayes’ theorem.
404
John D. Norton
comparison.18 That there are different sorts of uncertainty even in decision theoretic contexts is illustrated by Ellsberg’s [1961] urns. Consider two urns, each containing 100 balls. The first has some combination of red and black balls. Any ratio from 0100% red is possible and we are completely uncertain over which. The second urn has exactly 50 red and 50 black balls. A ball will be drawn from each urn. What is our belief that it will be red? For both urns, the symmetry of the arrangements requires that we be equally certain of a red or a black, so, in drawing from both urns we should be indifferent to making bets on red or on black. Any protocol for converting these inclinations to bet into degrees of belief must respect that symmetry. Therefore, if the protocol generates probabilities solely on the basis of these inclinations, in each case it will return a probability of 0.5. However Ellsberg suggests that most of us prefer the bet on the urn of known composition, so that the degrees of belief assigned to red ought to differ in the two cases. For our purposes, this illustrates the possibility of different senses of uncertainty that may not be readily comparable if we try to characterize them in a single magnitude. These different senses reflect a distinction to be made in Section 4 below between disbelief and ignorance. Jaynes [2003, pp. 658-59] regards such quibbles over different senses of uncertainty with characteristic, entertaining distain. He makes an analogy to a mineralogist who may classify rocks with different parameters but can still find a way to trade off changes in one parameter against the other to recover a single scale of comparison. The idea that universal comparability must be discarded is a consequence of a significant development of the Bayesian approach that has adopted several forms. What they have in common is the notion that a single number may not be rich enough to capture the extent of belief. That concern is readily motivated whenever we are pressed to assign a definite probability to some outcome — say, whether it will rain tomorrow. Assigning a probability 0.6 is too precise. Our belief is better represented by something vaguer, a probability somewhere around 0.6. In one approach, we might replace definite values by intervals of values; the probability of rain is 0.5 to 0.7. In another we might take set of probability measures as representing our belief states — say the set of all probability measures that assign a probability of 0.5 to 0.7 to rain tomorrow. Developing these proposals into well functioning theories is not straightforward. For developments of these notions, see [Kaplan, 1998; Kyburg, 1959; Kyburg and Teng, 2001; Levi, 1974; 1980, Ch. 9; Walley, 1991]. For our purposes, what matters is that these proposals break with the framework sketched in that they deny universal comparability. Take the simplest case of interval-valued degrees of belief. The belief in some outcome represented by the 18 On the basis of their being “sufficiently disparate,” Fishburn [1986, p. 339] offers the following pair as candidates for failure of comparability: “A=Mexico’s City’s population will exceed 20,000,000 by 1994; B=The first card drawn from this old and probably incomplete bridge deck will be a heart.”
Challenges to Bayesian Confirmation Theory
405
interval [0.8, 0.9] is greater (in the sense of being closer to certainty) than the interval [0.1, 0.2]. But the intervals [0.3, 0.7] and [0.4, 0.6] are none of greater than, less than or equal to each other. Analogous effects arise for richer structures that dispense with single valued degrees of belief.
3.4
Transitivity
If we allow that the degrees are always comparable, it does not yet follow that they form a partial order, which is reflexive, antisymmetric and transitive. Of these three properties, transitivity is of the greatest concern. Norton [2007, §3.2.2] has suggested that transitivity may fail in certain complicated cases if common wisdoms on inductive support obtain. That is, if we have a case of evidence entailed by several different hypotheses, it is routine to say that the evidence supports one of these hypotheses more than another if the first hypothesis displays more of some relevant virtue, such as greater simplicity, greater explanatory power or greater fecundity. Using three hypotheses and three virtues, it is easy to envisage structures in which the first hypothesis is better confirmed than the second, the second better than the third and the third better than the first. This violates transitivity.19
3.5
Real Values
Once it is allowed that the degrees may be partially ordered, it is still not assured that they must be real valued, or isomorphic to the reals or some interval of real numbers. A great deal of effort has been expended in determining what additional assumptions are needed to assure that the degrees are real valued. These additional assumptions have been given many forms and are generally known as “Archimedean axioms.” [Fishburn, 1986, pp. 341-42]. Their function is to block the possibility of value sets that extend beyond real values in admitting infinitesimally small or infinitely large values. The familiar illustration in this literature of how partially ordered degrees may fail to be real valued is generate by degrees that are ordered pairs hx, yi of reals x and y [Jeffreys, 1961, pp. 19-20]. They are partially ordered by the rule hX, Y i > hx, yi
when X > x, no matter the values of Y, y or, if X = x, when Y > y
There is no way to map these degrees hx, yi onto the reals so that the partial order is preserved. We could imagine a two parameter family of hypotheses Hx,y such that our degrees of belief in the various hypotheses may be ordered according to the rule 19 For the example of a slightly bent coin, Fishburn [1986, p. 339] describes three propositions for which intransitivity “do[es] not seem unreasonable”: “A=The next 101 flips will give at least 40 heads; B=The next 100 flips will give at least 40 heads; C=The next 1000 flips will give at least 460 heads.” The suggestion is that we could have the same belief in A and in C; the same belief in C and in B, but that we must have more belief in A than B.
406
John D. Norton
just sketched. One might doubt whether such a family could arise in realistic problems and take that doubt as a reason to discount the challenge. Or one might be concerned that adopting a Bayesian confirmation theory presumes in advance that such a distribution of belief is inadmissible. Such a presumption is not compatible with the universal applicability supposed for the theory. 4
4.1
ADDITIVITY
The Property
Probability measures are additive measures. That means that they conform to the condition: If A and B are mutually exclusive then, for any C, P (A ∨ B|C) = P (A|C) + P (B|C).
(A)
Additivity is the most distinctive feature of the probability calculus. In the context of confirmation theory, it amounts to assigning a particular character to the degrees of belief of the theory. The degrees of this additive measure will span from a maximum value for an assured outcome (conventionally chosen as one) to a minimum value for an impossible outcome (which must be zero20 ). If the high values are interpreted as belief with unit probability certainty, then it follows that the low values must represent disbelief, with zero probability complete disbelief. To see this, recall that near certain or certain belief in A corresponds to near complete or complete disbelief in its negation, ∼ A.21 If we assign high probability 0.99 or unit probability to some outcome A, then by additivity the probability assigned to the negation ∼ A is 0.01 or zero, so that these low or zero values correspond to near complete disbelief or complete disbelief. This interpretation of low probability as disbelief is already expressed in the functional dependency between the probabilities of outcomes and those of their negations. We have P (∼ A|C) = 1 − P (A|C), which entails P (∼ A|C) is a strictly decreasing function of P (A|C)
(A′ )
More generally, taking relative negations, P (∼ A&B|C) = P (B|C) − P (A&B|C), we recover the more functional dependency P (∼ A&B|C) is a strictly decreasing function of P (A&B|C)
(A′′ )
and a strictly increasing function of P (B|C). These dependencies tell us that a high probability assigned to an outcome corresponds to a low probability assigned to its negation or relative negation, which is the characteristic property of a scale of degrees that spans from belief to disbelief. 20 If Imp is an impossible outcome, we have Imp = Imp ∨ Imp. Since Imp and Imp are mutually exclusive in the sense that Imp & Imp = Imp, additivity applies so that P (Imp) = P (Imp) + P (Imp), from which we have P (Imp)=0. 21 I take it as a definition that disbelief in A is the same thing as belief in not-A.
Challenges to Bayesian Confirmation Theory
407
While additivity (A) entails weaker functional dependencies (A′ ) and (A′′ ), the gap between them and (A) is not great. If the functional dependencies (A′ ) and (A′′ ) are embedded in a natural context, it is a standard result in the literature that the resulting degrees can be rescaled to an additive measure satisfying (A). (See [Aczel, 1966, pp. 319-24; Norton, 2007, §7].)
4.2
Non-Additive Measures: Disbelief versus Ignorance
Once it is recognized that the additivity (A) of the probability calculus amounts to selecting a particular interpretation of the degrees, then the ensuing challenge becomes inevitable. In many epistemic situations, we may want low degrees to represent ignorance or some mix of ignorance and disbelief, where ignorance amounts to a failure to commit to belief or, more simply, an absence of belief or disbelief. Shafer [1976, pp. 22-25] considers how we might assign degrees of belief to the proposition that there are living beings in orbit around Sirius, an issue about which we should suppose we know nothing. No additive measure is appropriate. If we assign low probability to “life,” additivity then requires us to assign high probability to “no-life.” That asserts high certainty in there being no life, something we do know. The natural intermediate of probability 1/2 for each of the two outcomes “life” and “no-life” fails to be a usable ignorance value since it cannot be used if we are ignorant over more than two mutually exclusive outcomes. This example makes clear that introducing an element of ignorance requires a relaxation of the functional dependencies (A′ ) and (A′′ ). It must be possible to assign a low degree of belief to some outcome without being thereby forced to assign a high degree to its negation. Measures that allow this sort of violation of additivity are “superadditive”: if A and B are mutually exclusive then, for any C, P (A ∨ B|C) ≥ P (A|C) + P (B|C). The extent to which the equality is replaced by an inequality is the extent to which the measure allows representation of ignorance. The best-known superadditive calculus is the Shafer-Dempster calculus. In it, an additive measure m, called a “ basic probability assignment,” is defined over the power set of the “frame of discernment” Θ, so that ΣA⊆Θ m(A) = 1, where the summation extends overall all subsets of Θ, and m({}) = 0. This basic measure probability assignment is used to generate the quantity of interest, the “belief function” Bel, which is defined as Bel(A) = ΣB⊆A Bel(B) for any A in Θ, where the summation is taken over all subsets B of A. These belief functions allow blending of disbelief and ignorance. For example, that we are largely ignorant over the truth of “life” can be represented by m(life) = 0.1
m(∼ life) = 0.1
m(life∨ ∼ life) = 0.8
which induces the belief function Bel(life) = 0.1
Bel(∼ life) = 0.1
Bel(life∨ ∼ life) = 1
408
4.3
John D. Norton
Complete Ignorance: The First Problem of the Priors
In Bayesian confirmation theory, prior probability distributions are adjusted by Bayes’ theorem to posterior probability distributions that incorporate new evidence learned. As we trace this chain back, the prior probability distributions represent states of greater ignorance. Bayesian confirmation theory can only give a complete account of this learning process if it admits an initial state representing complete ignorance. However representing ignorance is a long-standing difficulty for Bayesian confirmation theory and the case of complete ignorance has been especially recalcitrant. It is designated here as the first problem of the priors, to distinguish it from another problem with prior probability delineated below in Section 5.6 below. There are two instruments already in the probability literature that are able to delimit the representation of this extreme case, the epistemic state of complete ignorance. Norton [2008] has given an extended analysis of how the two may be used to do this. The first instrument is the principle of indifference. It asserts that if we have no grounds for preferring one outcome to a second, then we should assign equal belief to both. This platitude of evidence is routinely used to ground the classical interpretation of probability and famously runs into trouble when we redescribe the outcome space. Complete ignorance in one description is equivalent to complete ignorance in another. That fact allows one to infer quite rapidly that the degree of belief assigned to some compound proposition A ∨ B should be the same as the degree of belief assigned to each of the disjunctive parts A and B, even though A and B may be mutually exclusively.22 Since no probability distribution can have this property, it is generally concluded that there is something wrong with the principle of indifference. The difficulty is that the principle of indifference is not so easily discarded. It is a platitude of evidence. If beliefs are grounded in reasons and we have no reasons to distinguish two outcomes, then we should have the same belief in each. The alternative is to retain the principle of indifference and discard the notion that a probability distribution can adequately represent complete ignorance. Instead we are led to a representation of complete ignorance by a non-probabilistic distribution with three values: Max and Min for the extreme values of certainty and complete disbelief and Ig (“ignorance”) for everything in between Degree (A)
= Max, for A an assuredly true outcome = Min, for A an assuredly false outcome = Ig, for A any contingent23 outcome
(I)
22 An example is von Mises’ famous wine-water problem. We have a glass with a mixture of wine and water, knowing only that the ratio of wine to water lies in 1/2 to 2. So we are indifferent to each of the intervals wine to water: (a) 1/2 to 1, (b) 1 to 1 1/2, (c) 1 1/2 to 2; and assign equal probability of 1/3 to each. However if we redescribe the problem in terms of the ratio of water to wine, we end up assigning equal probability of 1/3 to the intervals water to wine (a′ ) 1/2 to 1, (b′ ) 1 to 1 1/2, (c′ ) 1 1/2 to 2. Now the interval (a) describes the same outcome as the disjunction (b′ ) ∨ (c′ ). So we assign probability 1/3 to the disjunction (b′ ) ∨ (c′ ) and also to each of its parts.
Challenges to Bayesian Confirmation Theory
409
The essential property here is that we can assign the ignorance degree Ig to some contingent outcome A ∨ B and that same ignorance degree to each of its mutually exclusive, disjunctive parts, A and B. This is the only distribution for which this is true over all contingent outcomes. The second instrument used to delineate the epistemic state of complete ignorance is the notion of invariance, used so effectively by objective Bayesians, but here used in a way that objective Bayesians may not endorse. The notion ultimately produces serious problems for Bayesian confirmation theory. The greater our ignorance, the more symmetries we have under which the epistemic state should be invariant. It is quite easy to accumulate so many symmetries that the epistemic state cannot be a probability distribution. For example, let us say we know only x is a real number the interval (0,1).
(DATUM)
This information (DATUM) remains true if x is replaced by x′ = 1 − x, where the function x′ (x) is self-inverting. It also remains true if x is replaced by x′′ = 1 − (1 − (1 − x)2 )1/2 , where once again the function x”(x) is self-inverting. So our epistemic state must remain unchanged under each of these transformations. It is easy to show that no probability distribution can be invariant under both. (See [Norton, 2008, §3.2].) Invariance requirements can be used to pick out the unique state of complete ignorance, which turns out to be (I) above. To see the relevant invariance, consider some proposition A over whose truth we may be completely ignorant. Our belief would be unchanged were A replaced by ∼ A.24 That is, the state of complete ignorance remains unchanged under a transformation that replaces every contingent proposition with its negation. It can readily be seen that the ignorance distribution (I) satisfies this invariance requirement. Each contingent proposition and its negation are assigned the same degree Ig.25 That this ignorance distribution (I) is the only distribution satisfying this invariance that we are likely to encounter is made more precise by the demonstration that it is the only monotonic26 distribution of belief with the requisite invariance [Norton, 2008, §6]. 23 Contingent
propositions are defined here as propositions that may be either true of false. this transformation more figuratively makes the invariance intuitive. Imagine that the content of proposition A has been written in a normal English sentence by a scribe on a slip of paper, folded before us on the table. We form our belief over the truth of the sentence on the paper: it is complete ignorance because we have no idea what the sentence says. We are now told that the scribe erred and mistakenly wrote the content of ∼ A on the paper instead. That new information would not change our belief state at all. 25 It is assumed that the ignorance does not extend to logical truths, so that we know which propositions are assuredly true and false and we assign Max and Min to them. We could define a broader ignorance state in which logical truths are presumed unknown as well by assigning the same value Ig to all propositions. 26 A distribution of belief “degree (.)” is monotonic if, whenever A logically entails B, degree(A) ≤ degree(B). 24 Describing
410
John D. Norton
The negation map — the transformation that replaces each contingent proposition with its negation — is a little more complicated than it may initially seem. To see the difficulty, imagine that the outcome space is exhausted by n mutually exclusive atomic propositions, A1 , A2 , . . . , An . The transformation replaces atomic propositions, such as A1 by compound propositions, such as A2 ∨ A3 ∨ . . . ∨ An , so it may not be evident that the transformation is a symmetry of the outcome space. It is, in the sense that it maps the outcome space back to itself and is self-inverting; A2 ∨A3 ∨. . .∨An is mapped to A1 . However, it does not preserve additive measures. The transformation takes an additive measure m to what Norton [2007a] describes as a “dual additive measure” M . These dual additive measures have properties that are, on first acquaintance, odd looking. They are additive, but their additivity is attached to conjunctions. If we have propositions A and B such that A ∨ B is always true, then we can add their measures as M (A&B) = M (A) + M (B). The notion of a dual measure allows a simple characterization of the ignorance distribution (I): it is the unique, monotonic measure that is self-dual. If one does not see that an epistemic state of complete ignorance is represented by the non-probabilistic (I), one is susceptible to the “inductive disjunctive fallacy” [Norton, forthcoming a]. Let a1 , a2 , a3 , . . . be a large number of mutually exclusive outcomes over which we are in complete ignorance. According to (I), we remain in that state of complete ignorance for any contingent disjunction of these outcomes, a1 ∨a2 ∨a3 ∨. . . If one applies probabilities thoughtlessly, one might try to represent the state of complete ignorance by a broadly spread probability distribution over the outcomes. Then the probability of the disjunction can be brought close to unity merely by adding more outcomes. Hence one would infer fallaciously to near certainty for a sufficiently large contingent disjunction of outcomes over which we are individually in complete ignorance. The fallacy is surprisingly widespread. A striking example is supplied by van Inwagen [1996] in answer to the cosmic question “Why is there anything at all?” There is, he asserts, one way for no thing to be, but infinitely many ways for different things to be. Distributing probabilities over these outcomes fairly uniformly, we infer that the disjunction representing the infinitely many ways things can be must attract all the probability mass so that we assign probability one to it.
4.4
Bayesian Responses
The literature in Bayesian confirmation theory has long grappled with this problem of representing ignorance. That is especially so for prior probability distributions, where the presumption that ignorance must be representable is most pressing. Perhaps the most satisfactory response comes through the basic supposition of subjective Bayesians that the probabilities are subjective and may vary from person to person as long as the axioms of the probability calculus are respected. So the necessary deviation from the non-probabilistic ignorance distribution (I) in some agent’s prior probability distribution is discounted as an individual aberration not reflecting the true evidential situation. The price paid in adopting this
Challenges to Bayesian Confirmation Theory
411
response, the injection of subjectivity and necessity of every prior being aberrant, is too high a price for objective Bayesians, who are committed to there being one probability distribution appropriate for each circumstance. However objective Bayesian methods have not proven able to deliver true “ignorance priors,” even though the term does appear over-optimistically in the objective Bayesian literature [Jaynes, 2003, Ch.12]. One approach is to identify the ignorance priors by invariance properties. That meets only with limited success, since greater ignorance generates more invariances and, as we saw in Section 4.3 above, eventually there are so many invariances that no probability measure is admissible. An alternative approach is to seek ignorance priors in distributions of maximum entropy.27 Maximum entropy distributions do supply what are, in an intuitive sense, the most uniform distributions admissible. If, for example, we have an outcome space comprising n atomic propositions, without further constraints, the maximum entropy distribution is the one uniform distribution that assigns probability 1/n to each atomic proposition. However, if there is sufficient ignorance, there will be invariances under which the property of having attainted maximum entropy will not be preserved. In the end, it is inevitable that these methods cannot deliver the ignorance distribution (I), for (I) is not a probability distribution. So the best that can be expected is that they will deliver a distribution that captures ignorance over one aspect of the problem, but not all. The tendency in the literature now is to replace the misleading terminology of “ignorance prior” by more neutral terms such as “noninformative priors,” “reference priors” or, most clearly “priors constructed by some formal rule” [Kass and Wasserman, 1996]. Another popular approach to representing ignorance with Bayesian confirmation theory is to allow that an agent’s epistemic state is not given by any one probability measure, but by a set of them, possibly convex. (See for example [Levi, 1980, Ch. 9].) The deepest concern with this strategy is that it amounts to an attempt to simulate the failure of additivity that is associated with the representation of ignorance. Something like the complete ignorance state (I) can be simulated, for example, by taking the set of all probability measures over the same outcome space. The resulting structure is vastly more complicated than (I), the state it tries to simulate. It has become non-local in the sense that a single ignorance value is no longer attributed to an outcome. Each of the many probabilities assigned to some outcome must be interpreted in the context of the other values assigned to other propositions and in cognizance that there are many other distributions in the set.28 Finally, as pointed out in Norton [2007a; 2008], a set of probability measures necessarily falls short of simulating (I). For no set of additive measures can have the requisite invariance property of a complete ignorance state — invariance under the negation map. For a set of additive measures is transformed by the negation 27 For
further objections to maximum entropy methods, see [Seidenfeld, 1979]. a discussion of convex sets of probability measures and the how they contain the Shafer Dempster belief functions as a special case, see [Kyburg, 1987]. 28 For
412
John D. Norton
map into a set of dual additive measures. In informal terms, any set of additive measures on an outcome space preserves a directedness. For each measure in the set, as one proceeds from the assuredly false proposition to the assuredly true by taking disjunctions, the measures assigned are non-decreasing and must, at some point, increase strictly. Invariance under the negation map precludes such directedness.
4.5
Ignorance over a Countable Infinity of Outcomes
The difficulties of representing ignorance have been explored in the literature in some detail in the particular problem of identifying a state of ignorance over a countable infinity of outcomes. It has driven Bayesians to some extreme proposals none of which appear able to handle the problem in its most intractable form.29 The traditional starting place — already “well-known” when de Finetti [1972, p.86] outlined it– is to determine our beliefs concerning a natural number “chosen at random.” Its more figurative version is “de Finetti’s Lottery” [Bartha, 2004] in which a lottery ticket is picked at random from a countable infinity of tickets. If we write the prior probability for numbers 1, 2, 3, . . . as p1 , p2 , p3 , . . ., we cannot reconcile two conditions. First, since we have no preference for any number over any other, we assign the same probability to each number pi = pk all i, k Second, the sum of all probabilities must be unity p1 + p2 + p3 + . . . = 1
(CA)
No set of values for pi can satisfy both conditions.30 De Finetti’s own solution was to note that the condition (CA), “countable additivity” (as applied to this example), is logically stronger than finite additivity (A). The latter applies only to a finite set of outcomes — say, that the number chosen is a number in {1, 2, . . ., n}. It asserts that the probability of this finite set is the finite sum p1 + p2 + p3 + . . . + pn . Condition (CA) adds the requirement that this relation continues to hold in the limit of n→ ∞. De Finetti asserted that the
29 The problem of ignorance over a countable infinity of outcomes is actually no worse than the corresponding problem with a continuum outcome space. The latter problem contains the former in that a continuum outcome space can be partitioned into a countable infinity of subsets. Perhaps the countable case is deemed more problematic since outcomes can still be counted so it appears (incorrectly) that there will be a unique, natural probability measure recoverable from ratios of counts. In the continuum case, no such illusions are possible. The continuum case is more problematic in the sense that in it non-zero probabilities cannot be assigned to individual outcomes. Instead a probability density is used to assign probabilities to certain sets of outcomes. That presumes the outcome space has a topology that admits an additive measure. This device of a probability density becomes harder to use as the outcome space becomes larger. If the space consists of all real numbers, then the only uniform probability density is an improper density that cannot be normalized to unity. 30 Let p = p = p , then p + p + p + . . . = 0 or ∞, according to whether p is zero or non-zero, 1 2 3 i k both of which contradict (CA).
Challenges to Bayesian Confirmation Theory
413
probability distribution representing our epistemic state in this problem need only be finitely additive, not countably additive. That allows us to set pi = 0 for all i, without forcing the probability of infinite sets of outcomes to be zero. So if odd = {1, 3, 5, . . .} and even = {2, 4, 6, . . .} we can still set P (odd ) = P (even) = 0.5. Solving the problem by dropping countable additivity has proven to be a popular and well-understood solution. While proponents of the restriction to finite additivity are typically not frequentists (who identify probabilities with relative frequencies), the connection to frequentism is natural. The frequency of any particular natural number among all is zero and the frequency of even numbers is 1/2, in a naturally defined limit. Kadane and O’Hagan [1995] have mapped out which uniform, finitely additive probability distributions are possible over the natural numbers, noting how these are delimited by natural conditions such as agreement with limiting frequencies and invariance of probability under translation of a set of numbers. There are also variant forms of the proposal, such as the use of Popper functions and the notion of “relative probability” [Bartha, and Johns, 2001, Bartha, 2004]. However dropping countable additivity is not without disadvantages and enthusiasm for it is not universal. There are Dutch book arguments that favor countable additivity; and important limit theorems, including Bayesian convergence of opinion theorems, depend upon countable additivity. See [Williamson, 1999; Howson and Urbach, 2006, pp. 26-29; Kelly, 1996, Ch. 13].31 Other approaches explore the possibility of assigning infinitesimally small probabilities to what would otherwise be zero probability outcomes [McGee, 1994]. While all these solutions come at some cost, they are eventually unavailing. For they have not addressed the problem in its most acute form. They deal only with the case of ignorance over natural numbers, where this set of a countable infinity of outcomes has a natural order. If we presume that we know of no such natural order, then all these solutions fail, as has been shown in a paradox reported by Bartha [2004, §5.1] and in the work of Arntzenius [manuscript]. Imagine that we have a countable infinity of outcomes with no way to order them. If we have some labeling of the outcomes by natural numbers, that numbering is completely arbitrary.32 Let us pick some arbitrary labeling of the outcomes, 1, 2, 3, . . . , and seek an ignorance distribution over these labels. That ignorance distribution should be unaffected by any one-to-one relabeling of the outcomes; that is, the ignorance distribution is invariant under a permutation of the labels, for permutations are a symmetry of this system. Consider the outcomes odd = {1, 3, 5, . . . } and even = {2, 4, 6, . . . }. There is a permutation that simply 31 To someone with an interest in physics, where probabilities are routinely summed over infinitely many outcomes, the restriction to finite additivity appears ruinous to ordinary physical theorizing. Countable additivity is hidden in many places. It is presumed whenever we normalize a probability distribution p(x) over some real-valued parameter x. For example R R R 3/4 R 7/8 1 = 01 p(x)dx = 01/2 p(x)dx+ 1/2 p(x)dx+ 3/4 p(x)dx+ . . . 32 For a concrete example, imagine that an infinite space is partitioned into a countable infinity of geometrically identical cubes and that our cosmology tells us that a single hydrogen atom will appear in one of them without favoring any one of them. We arbitrarily labeling the cubes as 1, 2, 3, . . . .
414
John D. Norton
switches the labels of the two sets (1 ↔ 2, 3 ↔ 4, 5 ↔ 6, . . .), so that outcomes in each set are exchanged. Since our belief distribution is invariant under such a permutation, it follows that the permutation does not alter our belief and we must have the same belief in each outcome set odd and even. Now consider the four outcome sets one = {1, 5, 9, 13, . . .} = {4i + 1 : i = 0, 1, 2, 3, . . .} two = {2, 6, 10, 14, . . .} = {4i + 2 : i = 0, 1, 2, 3, . . .} three = {3, 7, 11, 15, . . .} = {4i + 3 : i = 0, 1, 2, 3, . . .} three = {4, 8, 12, 16, . . .} = {4i + 4 : i = 0, 1, 2, 3, . . .} There is a permutation that switches labels of one and two; so we have equal belief in one and two. Proceeding pairwise through the sets, we find we must have equal belief in each of one, two, three and four. Now there is also a pairwise permutation that switches the labels of one with those of two ∪ three ∪ four, where two ∪ three ∪ four = {2, 3, 4, 6, 7, 8, 10, 11, 12, 14, 15, 16, . . .} It is just the obvious permutation read off the above sets 1 ↔ 2, 5 ↔ 3, 9 ↔ 4, 13 ↔ 6, 17 ↔ 7, 21 ↔ 8, . . . So we now infer that we must have the same belief in one as in two ∪ three ∪ four. Combining we find: we must have the same belief in two ∪ three ∪ four and in each of its disjunctive parts two, three and four. This requirement cannot be met by a probability distribution for it contradicts (finite) additivity.33 It is however compatible with the ignorance distribution (I). Finally, it is sometimes remarked that the very idea of uniform ignorance over a countable set is somehow illicit for there is no mechanical contrivance that could select outcomes so that they have equal chances. No lottery commission could build a device that would implement the de Finetti lottery. For references to these concerns, see [Bartha, 2004, pp. 304-305], who correctly objects that the inevitable non-unformity of probabilities of the lottery machine does not force non-uniformity of beliefs. What needs to be added is that the entire objection is based on a circularity. In it, the notion of a mechanical contrivance tacitly supposes a contrivance whose outcomes are governed by a probability distribution. So it amounts to saying that no machine whose outcomes are governed by a probability distribution can generate outcomes governed by a non-probabilistic distribution. Recent work in philosophy of physics has identified idealized physical mechanisms that produce indeterministic outcomes that are not governed by a probability distribution. An example is the “dome,” described in [Norton, 2007, §8.3; forthcoming], where it is shown that the dome’s outcomes are governed by a non-probabilistic distribution 33 Another possibility is the improper probability distribution that assigns some small probability ε to each, individual outcome and infinite probability to the total outcome space. This improper distribution is badly behaved under conditionalization. For example P (2|even) = 0, not ε/2; and P (two|even) = ∞/∞ = undefined, not 1/2.
Challenges to Bayesian Confirmation Theory
415
with the same structure as (I). There are many more examples of these sorts of indeterministic systems in the “supertask” literature. (For a survey, see [Laraudogoitia, 2004].) Many of these mechanisms conform with Newtonian mechanics, but depend on idealizations some find “unphysical” for their distinctive behavior.34 These processes could be the physical basis of an idealized contrivance that implements something like the de Finetti lottery. 5
5.1
BAYESIAN DYNAMICS
The Property: Refute and Rescale
The properties investigated so far — framework and additivity — are essential to Bayesian confirmation theory. However we are still missing the essential part of the theory from which its name is derived. For so far, we have no means to relate probability measures conditioned on different propositions, so that we cannot yet relate the posterior probability P (H|E) of some hypothesis H conditioned on evidence E with its prior probability P (H) = P (H|B), for some background B. (Here and henceforth, for notational convenience, “P (A)” will be written as shorthand for “P (A|B)”, for some assumed background B.) These means for relating prior and posterior probabilities are supplied by two properties. The first property is “narrowness” and asserts for any A and C that P (A&C|C) = P (A|C)
(N )
The second property is “multiplication,” which asserts for any A and C that P (A&C) = P (A&C|C).P (C) = P (C&A|A).P (C)
(M )
These properties combined entail35 P (A|C) = P (A&C)/P (C) when P (C) is not zero. That this formula arises through compounding the two properties (N ) and (M ) is not generally reported. That compounding is important in the discussion that follows, since these two properties make distinct assertions in the context of confirmation theory and require separate analysis. These two properties (N ) and (M ) can also be combined to yield Bayes’ theorem. For hypothesis H and evidence E: P (H&E) = P (H&E|E).P (E) = P (H|E).P (E) = P (E&H) = P (E&H|H).P (H) = P (E|H).P (H) 34 For
an analysis of the notion of “physical” in this context and the suggestion that we discount these concerns, see [Norton, 2008a]. 35 The combined formula is sometimes regarded as the definition of conditional probability in terms of unconditional probability. That view is not taken here since P (A) is not unconditional, but shorthand for P (A|B). H´ ajek [2003] has objected that this candidate definition fails to yield an everywhere serviceable notion of conditional probability.
416
John D. Norton
so that when P (E) is not zero we have Bayes’ theorem P (H|E) =
P (E|H) P (H) P (E)
(B)
It tells us how learning evidence E should lead us to update our beliefs, as expressed in the transition from the prior P (H) to the posterior probability P (H|E). Bayes’ theorem embodies a very simple model of belief dynamics whose two steps essentially correspond to the two properties (N ) and (M ) above. Refute. When evidence E is learned, those parts of the hypothesis H logically incompatible with the evidence are discarded as irrelevant to the bearing of evidence. That is, the hypothesis H has two parts in relation to evidence E : H = (H&E) ∨ (H& ∼ E). The second part (H& ∼ E) is that part which is refuted by E. The first part (H&E) is the part of H that entails E. H can only accrue support through this part, since we have from (N ) that P (H|E) = P (H&E|E). Rescale. The import of evidence E on competing hypotheses H, H ′ , . . . is then simply the prior probability P (H&E), P (H ′ &E), . . . of those parts of the hypotheses logically compatible with the evidence, linearly rescaled so as to preserve normalization to unity; for, we have from (B) and (N ) that P (H&E|E) P (E|H&E).P (H&E) P (H&E) P (H|E) = = = P (H ′ |E) P (H ′ &E|E) P (E|H ′ &E).P (H ′ &E) P (H ′ &E) since P (E|H&E) = 1 = P (E|H ′ &E). A figurative model of this dynamics underscore its very great simplicity. In his “muddy Venn diagram,” Van Fraassen’s [1990. p. 161-62] pictures the total outcome space as a surface, such as a table top, and the degrees of belief assigned to outcome sets are represented by volumes of mud piled over the areas corresponding to each outcome. Conditionalization on evidence E occurs in two steps, as shown in Figure 1. First (“refute”), all the mud piled outside E is carefully swept away. Second (“rescale”), that swept-away mud is redeposited on E in such a way that the proportions in the contours over E are retained.36 It is informally evident that this qualitative notion of refute and rescale dynamics captures the essence of Bayesian dynamics. That informal idea has been made more precise in [Norton, 2007, §4], where the dependencies of refute and rescale dynamics are used to generate an axiom system for degrees, which turn out to be ordinary probabilities up to monotonic rescalings. 36 For example, if the mud is twice as high at one spot in relation to another prior to the redepositing of the mud, it will be piled twice as high after, as well.
Challenges to Bayesian Confirmation Theory
417
Figure 1. Muddy Venn diagram illustrates rescale and refute dynamics
5.2
Non- Bayesian Shifts
Under refute and rescale dynamics, the import of evidence is always to modify prior beliefs. Earman [1992, pp. 195-198] and Brown [1996] state the obvious difficulty. The dynamics cannot apply if there is no prior belief since the hypothesis or outcome in question is not in the outcome space. Just such a circumstance occurs at decisive moments in the history of science. At moments of rapid change — best known through Thomas Kuhn’s notion of a scientific revolution — the inconceivable becomes conceivable and, we are told, that evidence compels us to believe it. When Einstein proposed his special theory of relativity in 1905, he urged that there is no absolute fact as to whether spatially separated events are simultaneous; the simultaneity of events depends upon the inertial reference frame from which they are assessed. That notion was regarded as bizarre by Einstein’s critics, many of whom continued to refuse to take it seriously long after 1905. Another example is Bohr’s theory of the atom of 1913. That theory seemed to require commitment to an inconsistency: in different processes, accelerated electrons bound in an atom would be supposed to radiate, as classical electrodynamics required, or would be supposed not to radiate, in contradiction with classical electrodynamics, according to which supposition was expedient for the recovery of the results Bohr needed. Closely connected to the problem of outcomes not in the space are those to which it is natural to assign zero probability. A logical inconsistency, such as Bohr’s theory, is a natural candidate. It follows immediately from Bayes’ theorem that, once we have assigned a zero prior probability to any hypothesis, conditionalization cannot change its probability from zero. The only recovery is to reassign a non-zero probability to the hypothesis by a process that contradicts Bayes’ theorem. We might discount as rarities these inversions of what is conceivable and the oddity of accepting inconsistencies as established law. However even if they are rarities, they are important rarities that reappear throughout the history of our science and cannot be neglected by a comprehensive account of scientific rationality. The natural Bayesian rejoinder is to protest that too much is being asked of it. The formation of new outcome spaces belongs to Reichenbach’s context of discov-
418
John D. Norton
ery. Bayesian confirmation theory, indeed any logic at all, cannot be responsible both for logical relations among propositions and also for the discovery methods used to create new outcome spaces.
5.3
Is Bayes’ Theorem Sensitive Enough to the Connections between Evidence and Theory?
The simplicity of refute and rescale dynamics invites the next challenge: the dynamics is too simple and too insensitive to the ways that evidence and theory can relate. 5.3.1
Problems Arising from “Refute. . . .”
The first “refute” step of Bayesian conditionalization depends essentially on the idea that the disjunctive part of H = (H&E) ∨ (H& ∼ E) that extends beyond the total evidence E, that is H& ∼ E, does not affect the bearing of evidence E on H. That is, “narrowness” directly asserts that P (H|E) = P (H&E|E). While this idea is so familiar as generally to pass without comment, it does represent a blindness in Bayesian confirmation theory. As an illustration of it, imagine that we seek to identify some animal. Our evidence is that it is a bird. What support does that evidence lend to the possibility that the animal is a canary or, alternatively, a canary or a whale? Narrowness asserts that it lends the same support: P (canary or whale | bird) = P (canary | bird) In some scenarios, this can make sense. As we check instances of birds, we will find the frequency of “canary or whale” to arise exactly as often as “canary.” If these frequencies are all that matters, then we would say the two outcomes are equally supported. However few would halt there. We would discount the “whale” disjunct as a nuisance embellishment that should be dismissed as a distraction precisely because it is deductively refuted by the evidence “bird.” The frequency count just mentioned is blind to its nuisance character in so far as it assigns the same frequency of success to “canary” as to “canary or whale.” It is possible to devise a logic that is sensitive to this problem and punishes an hypothesis in the degree that it extends beyond the evidence. An example37 is the “specific conditioning” logic defined in [Norton, forthcoming b, Section 10.2]. In a formulation compatible with developments here, the logic amounts to a new definition for the conditional probability P (H|E). It starts with an additive measure 37 Specific conditioning resembles some of the measures of incremental support that have been investigated in the Bayesian literature, such as the ratio measure r(H, E) = P (H|E)/P (H) = P (H&E)/(P (E).P (H)). (See [Eels and Fitelson, 2002].) The two differ essentially. The incremental measures are functions of three arguments, H, E and B, that represent the incremental support accorded H by evidence E alone with respect to a tacit background B. PSC (H|E) is a function of two arguments, H and E&B, that measures the total support accorded to H by evidence E conjoined with the background B, that is, by the total evidence E&B.
Challenges to Bayesian Confirmation Theory
419
P (.) on the background and defines the alternative PSC (H|E) = P (H&E)2 /(P (E).P (H)) = P (H&E)/(P (E).P (H&E)/(P (H) (SC) This alternative rule of conditionalization rewards hypotheses H with more support the closer they come to the total evidence E = (E and background) — hence the term “specific conditioning.” PSC (H|E) = 1 only when H = E excepting measure zero differences. H can differ from E in two ways and the logic of (SC) penalizes it for both. In the first way, H can fail to exhaust E in measure. Then the first factor of (SC), that is, P (H&E)/(P (E), is less than one (which alone would correspond to the usual case of Bayesian conditionalization). In the second way, H can extend beyond E in measure. Then the second factor of (SC), that is P (H&E)/(P (H), is less than one. Deviations in both ways are penalized equally and it turns out that this is expressed in a striking symmetry in PSC . That is PSC (H|E) = PSC (E|H). Returning to the bird example, with plausible values for PSC (.), we will have PSC (canary or whale | bird) < PSC (canary | bird) Indeed we will have PSC (canary or whale | canary) < PSC (canary | canary) = 1 That is, even though “canary” deductively entails “canary or whale,” the latter is not accorded full support from “canary.” The logic of (SC) judges “canary or whale” to be supported less specifically by the evidence “canary.” In the case of Bayesian confirmation theory, evidence accords unit support to any of its deductive consequences, no matter how much they are weakened logically. PSC reduces the support accorded to these deductive consequences according to how much they are weakened logically. 5.3.2
Problems Arising from “. . . Rescale”
In the special case in which they hypothesis H deductively entails the evidence E, the first “refute” step of Bayesian conditionalization is not invoked and the process reduces to rescaling only by means of the property “multiplication” (M). This special case provides a means of isolating the second step and plumbing its problems. Consider two hypotheses H and H ′ both of which entail the evidence E. In that case, it follows from Bayes’ theorem that the ratio of the posterior probabilities P (H|E)/P (H ′ |E) is the same as the ratio of the priors. It now follows that the incremental confirmation, as measured by the ratio of posterior and prior P (H|E)/P (H), is the same in both cases. That is, when two hypotheses both entail the same true evidence, they get the same confirmatory boost. In particular, if H is more probable than, less probable than or just as probable as H ′ prior to conditionalization, it will remain so afterwards.
420
John D. Norton
This failure to separate H and H ′ evidentially, according to critics, is to overlook that their entailments of the evidence E can differ in more subtle qualities that have epistemic import. For example, recall efforts in the late 19th century to detect the earth’s motion through the luminiferous ether. These experiments yielded a null result (E). We may entertain two hypotheses: H: There is no ether state of rest against which the earth can move. H ′ : There is an ether state of rest but we happen to be motionless with respect to it each time the experiments are performed. Each of H and H ′ entail the evidence E, but it seems wrong that both should receive the same support from E. H succeeds, as it were, by honest toil; but H ′ by theft. There are many ways that this metaphor of honest toil and theft is explicated. One way notes that H succeeds because it explains why the experiments yielded null results, where as H’ does not. Considerations like this lead Achinstein [2001, Ch. 7] to argue that it is not sufficient to count as evidence that some datum incrementally increases the probability of an hypothesis or that the probability of the hypothesis conditioned on the datum is high.38 In addition there must be an explanatory connection. Other approaches are variants of this idea that some successful entailments are virtuous and others not. The hypothesis H makes the prediction that all ether current experiments will fail and that prediction is verified by the evidence. Since H ′ supposes that it is by happenstance that we were at rest in the ether just at the moment of the experiments, H ′ simply accommodates the evidence in the sense that it has been adjusted to accommodate experimental reports at hand. Predictions, for example, are virtuous and their success merits an increase in belief; accommodations are not virtuous and are epistemically inert. See [Horwich, 1982, Ch. 5] for discussion and an argument that we should not differentially reward prediction and accommodation. Another way of expressing these concerns is to note that H ′ (but not H) was cooked up “ad hoc” specifically to match the evidence. A classic example is the creationist hypothesis that the world was created in 4004 BC, complete with a fossil record that perfectly simulates a more ancient, evolutionary past. It is an evidential truism that such ad hoc hypotheses gain no support from the observations they accommodate. For an account of how ad hoc hypotheses can be treated in a Bayesian context, see [Howson and Urbach, 2006, pp. 121-126], who also suggest that sometimes there are “good” ad hoc hypotheses that deserve support. The literature on these evidential virtues in Bayesian confirmation theory is large and convoluted. In general, however, there are two broad strategies that Bayesian confirmation theory can employ when faced with a virtuous and nonvirtuous pair of hypotheses. The first is to expand the outcome space so that there 38 See [Achinstein, 2001, Ch.7]and passim) for numerous examples. For example, learning that most lottery tickets were unsold may increase greatly my belief that mine is the winning ticket, even though the absolute probability may remain small. See also [Laudan, 1997].
Challenges to Bayesian Confirmation Theory
421
are more resources available to pick the hypotheses apart.39 That risks merely postponing the problem since it may now return for another pair of hypotheses in the larger space. The second is to use prior probabilities to reward virtuous hypotheses and punish non-virtuous ones. That is, since the ratio of the posterior probabilities P (H|E)/P (H ′ |E) equals the ratio of the priors P (H)/P (H ′ ), we may reward the explanatory, predictive or non-ad hoc H by making the ratio P (H)/P (H ′ ) very large, so that there is a corresponding advantage for H in the posterior probabilities. As observed in [Norton, 2007, §5.3.3], this use of prior probabilities requires a curious prescience: we are to penalize the prior probability of an ad hoc hypothesis in advance in just the right degree so that the punishment perfectly cancels the as yet unknown confirmatory reward that will be accorded to the hypotheses by Bayes’ theorem. Further, prior probabilities for hypotheses or theories are assigned once, globally, whereas virtues are manifested locally and may differ from domain to domain. So we may find a theory explanatory in one domain and want to reward it with a high prior probability; but may find it explanatorily deficient in another, and may want to punish it with a low prior probability.
5.4
The “Likelihood Theory of Evidence” and “Likelihoodism”
The above discussion has dealt with the case of hypotheses that entail the evidence. Matters are slightly more complicated when the hypothesis only makes the evidence more or less probable. In that case, Bayes’ theorem allows us to compare the import of evidence E for two hypotheses H and H ′ as P (H|E) P (E|H) P (H) = ′ P (H |E) P (E|H ′ ) P (H ′ )
(B ′ )
It follows that the relative import of the evidence E is fully determined by the two likelihoods, P (E|H) and P (E|H ′ ). For the ratio of the prior probabilities P (H)/P (H ′ ) reflects our comparative belief in H and H ′ prior to any consideration of evidence E; the ratio of the posteriors P (H|E)/P (H|E ′ ) reflects our comparative belief in H and H ′ after the import of evidence E has been accommodated. The ratio of the likelihoods P (E|H)/P (E|H ′ ) is what takes us from the ratio of the prior to the ratio of the posteriors. Therefore, it expresses the relative import of evidence E on these two hypotheses. That the likelihoods capture all needed information for evaluating the bearing of evidence is the core notion of a “likelihood theory of evidence.” (In the special case in which the hypotheses entail the evidence, P (E|H) = P (E|H ′ ) = 1. Then the ratio of the posteriors equals the ratio of the priors and E cannot discriminate evidentially between H and H ′ .) Since so many of the problems of Bayesian confirmation theory focus on prior probabilities, an attractive modification to the theory is a view that excises prior 39 For example Maher treats prediction and accommodation by supposing that hypotheses are generated by methods that may or may not have certain items of evidence available to them when they generate the hypotheses. He then shifts assessment to the credibility of the methods. See [Achinstein, 2001, pp. 215-221] for discussion.
422
John D. Norton
probabilities but leaves the rest intact. The view, popularly known as “likelihoodism,” has been advocated by Edwards [1972] and Royall [1997]. As we have just seen, in Bayes’ theorem (B ′ ), the evidence E favors the hypothesis H over H ′ just in the ratio of their likelihoods, P (E|H)/P (E|H ′ ). The proposal is that this consequence of Bayes’ theorem be extracted and elevated to an independent principle, the “law of likelihood.” The judgment of evidential import is always comparative — evidence favors this hypothesis more or less than that — and the troublesome prior probabilities never need enter. The principal difficulty for likelihoodism is that it faces all the problems of the Bayesian refute and rescale dynamics recounted here, but without the resources of prior probabilities to ameliorate them. In particular, likelihoodists must say that evidence entailed by two hypotheses is equally favorable to both hypotheses no matter what their virtues or vices. For likelihoodists have lost the Bayesian mechanism of using prior probabilities to reward simpler hypotheses or ones with anticipated predictive power and to penalize ad hoc or contrived hypotheses. Royall [1997, §1.7] considers and offers likelihoodist rejoinders to such concerns.
5.5
Model Selection, Prediction and Simplicity
While the notion that the likelihoods capture the incremental import of evidence has enjoyed notable successes, its overall history has been troubled. There are recalcitrant cases in which the likelihoods alone are clearly too coarse a measure of the import of evidence. Likelihood rewards only accuracy in fitting data at hand. It has difficulty accommodating our common preference for simpler hypotheses that may be less accurate. In rewarding fit to the data at hand, likelihood may be a weaker guide to the greater truth sought that extends beyond these data; that is, likelihood tends to favor accommodation to present data as opposed to predictive success with future data. The familiar and long-standing illustration of these difficulties comes in the problem of curve fitting. Let us say that we wish to find the relationship between variables x and y:40 hx1 , y1 i = h0.7, 1.0i, hx2 , y2 i = h1.5, 1.8i, hx3 , y3 i = h2.1, 2.0i, . . .,
(DAT A)
which are shown on the graph of Figure 2 below. Our presumption is that these data were generated by the relation yi = f (xi ) + errori where errori is a random error term affecting the i-th pair of x − y values. (It is common to assume that these error terms are normally distributed and independent of one another.) The goal of our analysis is identification of the function 40 The full data set is h0.7, 1.0i, h1.5, 1.8i, h2.1, 2.0i, h2.3, 0.6i, h2.6, 1.0i, h3.8, 2.1i, h4.5, 3.1i, h4.7, 6.0i, h5.6, 6.9i, h5.6, 7.7i, h5.8, 4.9i, h6.2, 4.4i, h7.1, 7.7i, h7.6, 6.7i, h8.8, 10.1i, h8.9, 8.2i, h9.1, 8.1i, h9.3, 7.4i.
Challenges to Bayesian Confirmation Theory
423
f (x). The standard method is to find that function that generates a curve of best fit to the data. For data such as in Figure 2, we would normally seek a straight line. That is, we would assume that the unknown function is linear, so that the data was generated by yi = A + Bxi + errori
(LIN )
That straight line turns out to be yi = −0.332 + 0.997xi + errori
(LINbest )
The straight line LINbest is the best fit to the data in the sense that it is the straight line drawn from LIN that makes the DATA most probable. That is, its values of A and B maximize the likelihood P (DATA | LINbest ); and A = −0.332 and B = 0.997 are the “maximum likelihood estimators” of A and B.
Figure 2. Linear and cubic curves of best fit The complication is that we can find other functions f (x) that fit the data more closely and make it even more probable. Consider, for example, cubic functions yi = A + Bxi + Cx2i + Dx3i + errori
(CU B)
The particular cubic plotted in Figure 2 against the background of the same data, yi = 1.952 − 1.377xi + 0.581x2i − 0.0389x3i + errori (CUBbest ) maximizes the likelihood P (DATA | CUBbest ). As one can see from comparing the two curves visually, the cubic CUBbest fits the data more closely that the linear LINbest and it turns out that P (DATA| CUBbest ) > P (DATA | LINbest )
424
John D. Norton
Therefore, according to the likelihood theory of evidence, we should conclude that the data better supports the cubic CUBbest over the linear LINbest . It is evident from a cursory scan of the data that this is most likely a mistake. Whatever trend f (x) may be embodied within the data, it is heavily confounded by noise from the error term. While the data definitely gives us warrant to believe that f (x) is an increasing function of x, we have no warrant for the specifics of the cubic CUBbest .41 We should be much more modest in what we infer from data this noisy. The better fit of the cubic CUBbest has been achieved by the greater ability of a cubic to conform to random fluctuations from the trend induced by noise; that is, the curve is “overfitted.” This overfitting is most evident around x = 1, where the curve reverses direction; and again around x = 9, where the curve also reverses direction. Given the sparsity and scattering of the data, cubic CUBbest is merely tracking noise at these places, whereas the linear LINbest is responding to a real increase with x in the trend. The informal discussion of the last paragraph captures the practical thinking of curve fitting. It is grounded tacitly in the considerations of simplicity and prediction at issue here. We know that we can always find a curve that fits the data better by looking to more complicated functional forms, such as higher order polynomials. However, at some stage, we judge that we should prefer the simpler curve over the more complicated one that fits the data better. The reason is that our interests extend beyond the narrow problem of finding a curve that merely fits this particular data set well. Our data is only a sample of many more values of x and y; and we are interested in finding a function f (x) that will also fit these other as yet unseen values. That is, our goal is to find a function f (x) that will support prediction. We expect that we can best achieve that goal by forgoing some accuracy of fit with the present data of an overfitted curve in favor of a simpler functional form for f (x) that will enable future data to be fitted better. For an overfitted curve responds to random noise whose specific patterns of deviations will probably not reappear in future data; however a curve drawn from the simpler model responds more to the real relation that drives present and, we suppose, future data as well. The challenge to a Bayesian analysis is to find a way of capturing these last informal thoughts in a more precise analysis. The natural way to do this is to break up the inference problem into two parts.42 The first part requires us to consider only models. These are sets of hypotheses indexed by parameters. For example, LIN above is the set of linear functions with parameters A and B taking all real values. CUB above is the set of all cubic functions with the parameters A, B, C and D taking all real values. In the first part we decide which model is 41 Perhaps it is unfair to introduce a more elevated perspective that affirms this judgment against the cubic CUBbest . The data plotted was generated artificially from the model yi = xi + errori , so the linear LINbest has actually done a very good job of identifying the trend y = x. 42 Traditional non-Bayesian statistical methodology implements this two part procedure by testing whether deviations in the coefficients B and C of CUB from zero are statistically significant. If they are not, the null hypothesis of B=C=0 is accepted and estimation proceeds with the simple model LIN.
Challenges to Bayesian Confirmation Theory
425
appropriate to the data. For DATA above, we presume that would turn out to be LIN and not CUB. In the second part of the inference problem, we then find the curve that fits the data best just within that selected model. Implementing this program within Bayesianism runs into several difficulties. The functions of LIN are a subset of those of CUB, for CUB reverts to LIN if we set C = D = 0. So if we compute the posterior probabilities of the models on the DATA, we will always find P(LIN | DATA) ≤ P(CUB | DATA) from which it follows that that the model LIN can never be more probable on the DATA than the model CUB. So it seems that we cannot have a straightforward Bayesian justification for preferring LIN over CUB. Slightly less than straightforward grounds can be found. We compare not the posterior probabilities of models, but the boosts in probability each model sustains upon conditionalizing on the data. That is, in this modified approach, we would prefer LIN over CUB if the ratio P (LIN | DATA) / P (LIN) exceeds P (CUB | DATA) / P (CUB). The ratio of these two ratios is itself called the “Bayes’ factor.” This modified criterion of the Bayes’ factor has some plausibility since it conforms to the spirit of the likelihood theory in that we consider changes in probability under conditionalization, not absolute values of the posterior probability. However the modified criterion rapidly runs into difficulties with prior probabilities. To compute P (LIN | DATA), for example, we would use Bayes’ theorem, in which the likelihood P (DATA | LIN) appears. This likelihood is really a compound quantity. It is a summation over the infinitely many curves that appear in the model LIN, corresponding to all possible values of A and B. That is, Z P (DAT A|LIN ) = p(DAT A|A, B)p(A, B)dAdB allA,B
The problematic term is p(A, B), which is the prior probability density for the parameters A and B given that LIN is the correct model. Since it expresses our distribution of belief over different values of A and B prior to evidence, p(A, B) should favor no particular values of A or B. However it is familiar problem that no probability distribution can do this. If p(A, B) is any non-zero constant, then it cannot be normalized to unity when we integrate over all possible values of A and B. While no unqualified solution to these problems has emerged, it has proven possible to show that, for large data sets, the Bayes factor depends only weakly on the choice of the prior probability density. In that case, the model favored by the Bayes factor is the one that maximizes what is known as the Bayes Information Criterion (BIC): BIC = log Lbest – (k/2) log n
426
John D. Norton
where the maximum likelihood Lbest is the likelihood of the data when conditionalized on the best fitting curve in the model, k is the number of parameters in the model and n is the size of the data set. (For LIN, k is 2; for CUB, k is 4.) For further discussion, see [Wasserman, 2000]. In maximizing BIC, we tradeoff accuracy of fit (expressed in the maximizing of likelihood) against the simplicity of the model (expressed in the number of parameters k). This was not the goal in constructing BIC; it was generated from an analysis that seeks the most probable model on the evidence. An analogous criterion can be generated by directly seeking the model with the greatest predictive power, as opposed to the one that is most probable. This criterion, the Akaike Information Criterion (AIC), advocated by Forster and Sober [1994], is given by AIC = log Lbest − k The rationale is that overtly seeking predictive power is a surer way to get to a truth that extends beyond the particular data at hand. Seeking the most probable model risks favoring the vagaries of the particular data at hand. The connection to prediction is achieved through the notion of “cross-validation.” While we cannot now test the predictive powers of a model against presently unknown data, we can simulate such a test by dividing the data at hand into two parts. We use the first part to generate a best fitting hypothesis in the model and then we check how well the resulting hypothesis fits the remaining data. In a large set of N data, we leave out one datum and use the remaining N − 1 to generate hypotheses from the model. We then average the errors in fitting the original datum. If we allow each datum successively to be left out in the procedure and average the resulting scores, we recover an estimate of the predictive powers of the model. That estimate turns out to approach the AIC statistic for large N . [Forster, 2002, p. S128; Browne, 2000]. The presumption is that predictive powers manifested within a known data set in this computation will persist beyond the confines of that data set. It remains an open question whether likelihoods may be used as the basis of assessment of models when we need to find an appropriate accommodation of accuracy, simplicity and predictive power. That all the proposals above fail has been suggested recently by Forster [2006; 2007]. He has devised examples of pairs of models with the same number of parameters k that also deliver the same likelihood and maximum likelihood for a specified set of data. As a result, considerations of likelihood alone or the more refined BIC or AIC are unable to discriminate between the models. However it is evident from informal considerations that one model is predictively superior to the other. Forster’s simplest example43 draws on a set of data with three points hxi , yi i: h1, 1i, h2, 2i, h3, 3i
(DATA′ )
43 This example was presented in a talk by Malcolm Forster, “Is Scientific Reasoning Really that Simple?” in “Confirmation, Induction and Science” London School of Economics, March 8 – March 10, 2007, on March 8.
Challenges to Bayesian Confirmation Theory
427
The first model, Ha : yi = axi , comprises four functions H1/2 : yi = (1/2)xi
H1 : yi = xi
H2 : yi = 2xi
H3 : yi = 3xi
corresponding to the parameter values a = 1/2, 1, 2, 3. The relation of this model to DATA’ is shown in Figure 3.
Figure 3. Models H and K The second model, Ka : yi = xi /a + (a − 1)2 /a, comprises four functions K1 : yi = xi
K2 : yi = xi /2 + 1/2
K3 : yi = xi /3 + 4/3
K4 : yi = xi /4 + 9/4
corresponding to parameter values a = 1, 2, 3, 4. The relation of this model to DATA’ is shown in Figure 3. Which model does DATA’ favor? The computation of the likelihoods of the individual functions is easy since they are all zero or one. The only non-zero likelihoods are P(DATA’ | H1 ) = 1 P(DATA’ | K1 ) = 1 We assume that the prior probabilities of functions within each model is the same. That is44 P (Ha |H) = 1/4 for a = 1/2, 1, 2, 3 P (Ka |K) = 1/4 for a = 1, 2, 3, 4 Thus the likelihood associated with each model is the same P (DAT A′ |H) = P (DAT A′ |H1 ) × P (H1 |H) = 1/4 P (DAT A′ |K) = P (DAT A′ |K1 ) × P (H1 |K) = 1/4 44 In
these formulae, read H = H1/2 ∨ H1 ∨ H2 ∨ H3 and K = K1 ∨ K2 ∨ K3 ∨ K4 .
428
John D. Norton
Hence, we cannot use DATA’ to discriminate between the two models H and K by means of their maximum likelihoods or the likelihoods of the models. Moreover, we cannot discriminate between them by means of BIC or AIC; for both models agree on the maximum likelihood and the number of parameters, k = 1. While all the likelihood based instruments fail to discriminate between the models, informal considerations favor the model H. For it requires only one datum from DATA’, such as h1, 1i, to fix the function y = x in H that fits the remaining data. The remaining two data points act as tests of the model; or, we may view them as successful predictions. However it takes two data points from DATA’ to fix the function in K that fits the remaining data. The datum h1, 1i, for example, is compatible with both K1 and K2 . A second datum is needed to decide between them. Hence only one datum remains to test the function fitted; or the model needs more data before it can make successful predictions. As a result we informally judge that the model H is better tested and predictively stronger and thus better warranted by the evidence. (For discussion of this aspect of curve fitting, see [Glymour, 1980, Ch. VIII.])45 For further discussion of Bayesian and Likelihood based analyses of this problem, see [Howson and Urbach, 2006, 288-96] and [Forster and Sober, 2004] (which includes commentary by Michael Kruse and Robert J. Boik).
5.6
A Neutral Starting Point: The Second Problem of the Priors
Bayesian confirmation theory depicts learning from evidence as the successive conditionalization of a prior probability distribution on evidence so that the resulting posterior probability distributions increasingly reflect just the evidence learned. As we trace back along this chain of conditionalization, we expect to come to epistemically more neutral probability distributions, reflecting our lesser knowledge earlier in the process. The problem of the priors in both forms developed here is that the structure of Bayeisan confirmation theory essentially precludes neutral probability distributions. In the first problem of the priors (Section 4.3 above), we saw that the additivity of probability measures precludes their representing epistemic states of complete ignorance. Here we shall see that the Bayesian dynamics of refute and rescale also precludes a different sort of neutrality of the probability distribution. Bayesian dynamics makes it impossible to select a prior probability distribution that does not exercise a controlling influence on subsequent course of conditionalization.46 The essential problem is suggested by the notion of refute and rescale dynamics. The results of conditionalization must already be present in the prior probability; all that conditionalization does is to remove those parts of the prior probability 45 The
informal considerations invoked here may appear more familiar if we vary the example slightly. Imagine that we are to choose between the model H : y = ax and L : y = ax + b for DATA′ . Then we would allow that H is better warranted since a single datum is needed to fix the parameter a in H; whereas two data are needed to fix the two parameters a and b in L. 46 The phrase “problem of the priors” seems to mean somewhat different to things to different authors. For further discussion, see [Earman, 1992, pp. 57-59; Mellor, 2005, pp. 94-95].
Challenges to Bayesian Confirmation Theory
429
distribution attached to outcomes refuted by the evidence to reveal the pre-existing distribution of belief between the unrefuted outcomes. This same point is evident if we recall the formula governing conditional probabilities P (H|E) = P (H&E)/P (H) This formula tells us that P (H|E), the degree of belief we should have in H given that we have learned E, is fixed by the two prior probabilities P (H&E) and P (H). That means that if we fully specify our prior probabilities over some outcome space, we have also delivered an exhaustive catalog of what our belief in each outcome would be, given that we have learned some other outcome. For from being neutral, the prior probability distribution anticipates how our beliefs will evolve as we learn any sequence of admissible outcomes and different prior probability distributions can yield courses that differ greatly. This consideration gives further grounds for the abandoning of such terms as “ignorance priors” or “informationless priors” in favor of “priors constructed by some formal rule” as reported in Section 4.4 above. As with the first problem, objectivists have the most acute difficulty with this second problem of the priors. For they are committed to there being one, appropriate prior probability distribution in any given epistemic situation. Yet none can supply the neutral starting point appropriate in the absence of all evidence. So the objectivist must either accept the unpalatable conclusion that the initial prior probability contains information without evidential warrant; or that Bayesian dynamics can only be used once we have learned enough by other means to make a properly grounded selection of the prior probability. Subjectivists at first seem to have a better response. For them, a prior probability is simply a statement of personal opinion with no pretensions of evidential warrant. As evidence accumulates and is incorporated into the probability distributions by conditionalization, limit theorems assure us of a merging of opinion onto a unique distribution that properly represents the bearing of evidence. (For discussion of these theorems, see [Earman, 1992, Ch. 6].) However this solution comes at some cost. At any finite stage of the process, the posterior probability distribution is an unknown mix of unwarranted opinion and warranted support. While the limit theorems may assure us that, in particular circumstances, the mix will eventually converge onto warranted support, at any definite state we may be arbitrarily far from it.47 Imagine that we have collected some very large set of evidence Elarge and we are interested in its bearing on some hypothesis H. It will always be possible for a determined mischief maker to identify some prior probabilities for P (H&Elarge ) and P (Elarge ) so that P (H|Elarge ) is as close to one as you like or as close to zero as you like. The two differ essentially. The incremental measures are functions of three arguments, H, E and B, that represent the incremental support accorded H by evidence E alone with respect to a tacit background B. PSC (H|E) is a function of two arguments, H and E&B, that measures 47 Here
is the obligatory reporting of Keynes’ quip: “In the long run we are all dead.”
430
John D. Norton
the total support accorded to H by evidence E conjoined with the background B, that is, by the total evidence E&B.
5.7
The Prior Probability of Universal Generalizations
Neither objectivist nor subjectivist fares well if zero or unit prior probability has been assigned injudiciously to some outcome. Then Bayesian dynamics becomes completely dogmatic. For once they have been assigned zero or unit probability, it follows from Bayes’ theorem that these assignments are unrevisable by conditionalization.48 The complete disbelief of a zero prior probability and the complete belief of a unit probability will persist indefinitely. In concrete terms, this sort of dogmatism is a familiar if unwelcome phenomenon. Consider the dogmatic conspiracy theorist who discounts the benign explanation of a catastrophe in favor of an elaborate conspiracy committed by some secret agency. Once a zero prior probability has been assigned to the benign explanation, no failure to reveal the secret agency’s intervention or even existence can restore belief in the benign explanation. Rather Bayes’ theorem will provide mounting assurance that each failure is further evidence of the perfection of the secret agency’s cover-up. While the best Bayesian advice would seem to be very cautious before assigning a zero prior probability, there are suggestions, reviewed in more detail in [Earman, 1992, §4.2, 4.3], that hypotheses with the power of universal generalizations ought to be assigned a zero prior probability. These concerns arise for any hypotheses H that entails a countable infinity of consequences E1 , E2 , . . . 49 Considering only the first n consequences we have P (H)
= P (H|E1 &. . .&En ).P (En |E1 &. . .&En−1 ).P (En−1 |E1 &. . .&En−2 ). . . .P (E2 |E1 ).P (E1 )
Popper [1959, Appendix VII] urged that we should expect the instances E1 , E2 , . . . of a universal generalization H to be equiprobable and independent of one another. That is, P (En |E1 &. . .&En−1 ).P (En−1 |E1 &. . .&En−2 ).. . .P (E2 |E1 ).P (E1 ) = P (En ).P (En−1 ).. . .P (E2 ).P (E1 ) = P (E1 )n But since this must hold for arbitrarily large n and with P (E1 ) < 1, it follows that the prior P (H) = 0. The obvious weakness of Popper’s proposal is that the very fact that E1 , E2 , . . . are instances of a universal generalization H suggests that they are not independent. However, just a little dependence between E1 , E2 , . . . is not enough to allow a non-zero prior for H. Inspection of the above expression for P(H) shows that P (H) > 0 entails 48 Once P (H) = 0, Bayes’ theorem (B) requires P (H|E) = 0 for all admissible E. (Conditionalizing on evidence E for which P (E) = 0 leads to an undefined posterior P (H|E).) It follows that for any H ′ =∼ H for which P (H ′ ) = 1, P (H ′ |E) = 1 as well. 49 The simplest example is a universal generalization that asserts, “For all x, Q(x).” where x ranges over a countably infinite set of individuals a, b, c, . . . Its consequences are Q(a), Q(b), . . .
Challenges to Bayesian Confirmation Theory
431
Limn→∞ P (En |E1 &. . .&En−1 ) = 1 That is, finitely many favorable instances of H eventually make us arbitrarily sure that the next instance will be favorable. That was an outcome that Jeffrey [1983, p. 194] felt sufficiently akin to “jumping to conclusions” to want to renounce non-zero prior probabilities on universal hypotheses. Lest this commitment to the projectability of the hypothesis not seem immodest, the analysis needs only a slight modification to make it more striking. Group the instances E1 , E2 , . . . into sets that grow very rapidly in size. For example F 1 = E1
F2 = E2 &. . .&E10
F3 = E11 &. . .&E100 . . .
so that F1 &. . .&Fn = E1 &. . .&E(10n−1 ) Each of the Fi is a consequence of H, so the above analysis can be repeated with Fi substituted for Ei to infer that p(H) > 0 entails Limn→∞ P (E1 &. . .&E(10n+1 ) |E1 &. . .&E(10n ) ) = 1 That is, if we assign non-zero prior probability to an hypotheses with the power of a universal generalization, eventually we are willing to project from some finite set of its positive instances to arbitrarily many more at arbitrarily high probability. The point is not that there is something intrinsically wrong with such immodesty. Indeed for the right hypothesis, this may be just the appropriate epistemic attitude. The point is that assigning a zero or a non-zero prior probability to a hypothesis with the power of a universal generalization is never neutral. The former commits us dogmatically never to learning the hypothesis; the second, no matter how small the non-zero prior, commits us to its arbitrary projectability on finite favorable evidence. There is no intermediate value to assign that leaves our options open. Our inductive course is set once the prior is assigned. Finally, one sees that some significant prior knowledge is needed if we are to assign the non-zero prior probabilities prudently. For a countably infinite set of outcomes E1 , E2 , . . . corresponds to an uncountable infinity of universal hypotheses: each hypothesis corresponds to one of the uncountably many combinations of the Ei and their negations ∼ Ei . However we can assign non-zero prior probability to at most countably many of these hypotheses. That is, loosely speaking, when we assign our priors, we are selecting in advance the measure zero subset that we will be prepared to learn and dogmatically banishing the rest to disbelief no matter what evidence may accrue.
5.8
Uncertain Evidence
The versions of Bayesian confirmation theory considered so far depict learning as an absolute: if we are given evidence E, it cannot be doubted. That is quite unrealistic. No datum is ever certain and it is not uncommon in science that the accumulation of later evidence leads one to doubt the correctness of evidence collected
432
John D. Norton
earlier. Something like this could be incorporated into traditional Bayesianism if we separate, say, the inerrant sense datum D: “my retina senses a plesiosaur like shape on the surface of Loch Ness” from the fallible evidence E: “I saw a plesiosaur on Loch Ness.” That fallibility could then enter the analysis through the conditional probability P (E|D). This solution complicates the analysis without ultimately allowing that all evidence is fallible, for it merely pushes the inerrancy deeper into an elaborated outcome space. A direct solution has been proposed by Jeffrey [1983, Ch. 11]. Standard Bayesianism introduced evidence E by by conditionalizing, which forces unit probability onto E. Jeffrey instead proposed that we merely shift a lesser amount of probability to E, commensurate with our confidence in it. This non-Bayesian shift from initial probability Pi to final probability Pf is then propagated through the outcome space by Jeffrey’s rule. For this simplest case of evidence that partitions the outcome space into {E, ∼ E}, the rule asserts that, for each proposition A, Pf (H) = Pi (H|E).Pf (E) + Pi (H| ∼ E).Pf (∼ E) The distinctive property of this rule is that it leaves unaffected our judgments what the probability of H would be were E definitely true or definitely false. That is, for all H, it satisfies Pi (H|E) = Pf (H|E) and Pi (H| ∼ E) = Pf (H| ∼ E). The uniqueness of Jeffrey’s rule follows in that it proves to be the only rule that satisfies these two conditions [Jeffrey, 1983, p. 169; Howson and Urbach, 2006, p. 85]. Jeffrey’s rule reduces to ordinary conditionalization if the evidence E is certain, Pf (E) = 1. One unappealing aspect of Jeffrey’s rule is that the order in which it is applied in successive applications matters. That is, if we apply it first on a partition {E, ∼ E} and then {F, ∼ F }, in general we do not arrive at the same final probability distribution as when the order is reversed. (See [Diaconis and Zabell, 1982].) In the context of the Shafer-Dempster theory (Section 4.2 above), Dempster’s rule of combination allows uncertain evidence to be combined with other belief distributions in an analogous way, but one that is independent of the order of combining. With appropriate restrictions, Dempster’s rule of combination can yield ordinary Bayesian conditionalization as a special case. See [Diaconis and Zabell, 1986] for discussion of the relationship between Jeffrey’s rule and Dempster’s rules. 6
FURTHER CHALLENGES
This concluding section collects mention of some of the many further challenges to Bayesian confirmation theory that can be found in the literature. In a letter to Nature and a subsequent article, Popper and Miller [1983, 1987] urged that the probabilistic support evidence E provides an hypothesis H in Bayesian confirmation theory is not inductive support. Their argument depended upon decomposing H into conjunctive parts by the logical identity
Challenges to Bayesian Confirmation Theory
433
H = (H ← E)&(H ∨ E) where the first conjunct, “H, if E,” is (H ← E) = (H∨ ∼ E). Since the second conjunct (H ∨ E) is deductively entailed by E, they identify the first, (H ← E), as “containing all of [H] that goes beyond [E]” [1983, p. 687] and require that E supplies inductive support for H only in so far as it supports this part that is not deductively entailed by E. However it is easy to show that P (H ← E|E) ≤ P (H ← E) so that (H ← E) never accrues probabilistic support from E. While Popper and Miller’s argument drew much discussion (for a survey see [Earman, 1992, pp. 95-98]), the decisive flaw was already identified immediately by Jeffrey [1984], who objected that (H ← E) is not the part of H that goes beyond E. It rests on a confusion of “part” with “deductive consequence.” While we might think of theorems as logical parts of the axioms that entail them, that sense of part no longer works when inductive relations are the concern. For example, let H be that my lost keys are in my office (“office”) or in my bedroom (“bedroom”) and E is the evidence that they are in my house. Then (H ← E) is the enormous disjunction office ∨ bedroom ∨ Mars ∨ Jupiter ∨ Saturn ∨ . . . where the ellipses include all places not in my house. It is as peculiar to call this disjunction a part of the original hypothesis as it is unsurprising that it is not probabilistically confirmed by the evidence. Another challenge to Bayesian confirmation theory comes from what first appears as an unlikely source. Over recent decades, there has been a mounting body of evidence that people do not assess uncertainty in accord with the strictures of probability theory. The most famous example comes from Kahneman and Tversky’s [1982] experiments that showed that people are quite ready to contradict the monotonicity of a probability measure — the property that an outcome is never less probable that each of its logical consequences. In the case of “Linda,” they found that people would affirm that a suitably described character Linda is more likely to be a bank teller and feminist than a consequence of this possibility, that she is a bank teller. One response to this body of work is to lament the unreliability of people’s intuitions about uncertainty and the need for them to receive training in the probability calculus. Another response notes that we humans have been dealing more of less successfully with uncertainty for a long time; this success extends to much of science, where uncertainties are rarely treated by systematic Bayesian methods. So perhaps we should seek another approach based a little more closely on what people naturally do. Cases like “Linda” suggest that we naturally try to select the one conclusion that best balances the risk of error against informativeness and detach it from the evidence. That is quite compatible with preferring to detach less probable outcomes. When told a coin was tossed, most people would infer to the outcome “heads or tails” as opposed to “heads or tails or on edge” even though the latter is more probable and a certainty.
434
John D. Norton
In philosophy of science, if a numerical calculus is used to assess the bearing of evidence, it is most often the probability calculus. In other domains, that is less so. In the artificial intelligence literature, alternative calculi proliferate. They include the Shafer-Dempster theory, possibility theory [Zadeh, 1978] as well as numerical systems generated for particular contexts, such as the MYCIN system applied in medical diagnostics [Shortliffe and Buchanan, 1985]. One motivating factor has been pragmatic. A probabilistic analysis requires a prohibitively large number of quantities to be fixed for the general case. In an outcome space with n atomic propositions, there are of the order of n! conditional probabilities. Seen from that perspective, possibility theory’s simple rule for disjunction is appealing: the possibility of AvB is just the larger of the possibilities of A or B individually. That same rule would seem disastrous to the systematic philosopher of science schooled with Bayesian intuitions, who will wonder why the rule does not distinguish the cases of A and B coincident or mutually exclusive and all those in between. The debate between proponents of the different calculi, including probabilistic systems, has been fierce. The principle goal seems to be to establish that each proponent’s system has the same or greater expressive power than another’s. (See for example [Cheeseman, 1986; Zadeh, 1986].) The issue of “stopping rules” has entered the philosophy of science literature from the debate between Bayesian and Neyman-Pearson statisticians. It pertains to a deceptively simple plan for assuring belief in a false hypothesis. Assume we have a fair coin. On average, on repeated tosses, it will show heads with a frequency of about 0.5. The actual frequency will fluctuate from this expected 0.5 and there will always be some definite but small probability that the observed frequency is arbitrarily far removed from the expected. In order to advance the misperception that the coin is not fair, we toss it repeatedly, waiting mischievously for chance to realize one of these rare removals that is possible but very unlikely for a fair coin. That is our “stopping rule.” When the rare removal happens we report the resulting count of heads and tails as evidence that the coin is not fair. The plan can be generalized to seek disconfirmation of any statistical hypothesis. We accumulate observations, waiting for chance fluctuations to deliver results that can only arise with low probability if the hypothesis is true. The result is troubling for Bayesians. For them, the import of evidence is fully determined by the likelihood ratio and that likelihood ratio has been contrived to favor arbitrarily strongly the falsity of the hypothesis. That the experimenter intended to generate a deceptive result is irrelevant; the intentions of the experimenter collecting the data do not appear in Bayes’ theorem. Neyman-Pearson statisticians have a ready escape that is not available to the Bayeisan: the strategy of “try and try again until you get a statistically significant result” is not a good test of the hypothesis. Correspondingly, the procedure that uses this stopping rule does not count as a severe test of the hypothesis in the sense of Mayo [1996, §11.3, 11.4]. The Bayesian response has been to reaffirm that the likelihoods are all that is relevant to discerning the bearing of evidence and also to point out that the
Challenges to Bayesian Confirmation Theory
435
stopping rule may never be satisfied, so that the planned disconfirmation can fail. That is, assume that a Bayesian agent assigns a prior probability p to some hypothesis and decides to accumulate evidence concerning the hypothesis until the posterior probability has risen to q > p. Kadane, Schervisch and Seidenfeld [1996] show that, if the hypothesis is false, the agent’s probability that the stopping rule is eventually satisfied is given by Probability of stopping ≤
p(1−q) q(1−p)
Therefore, the closer the desired posterior probability q is set to one, the closer the probability of stopping comes to zero; that is, the probability as assessed by the agent approaches one that the experiment will continue indefinitely without returning the result sought in the stopping rule.50 For more discussion see see [Savage et al., 1962, pp. 17-20 and subsequence discussion; Mayo, 1996, §11.3, 11.4; and Kadane et al., 1996].
7
CONCLUSION
This chapter has reviewed many challenges to Bayesian confirmation theory. Some are global, reflecting differences in the very conception of inductive inference and the inductive enterprise in science. Others are more narrowly targeted to specific commitments of Bayesian confirmation theory. That there are so many challenges is a reflection of the dominance that Bayesian confirmation theory now enjoys in philosophy of science. It offers a well-articulated account of inductive inference in science that is of unrivalled precision, scope and power. In this regard it greatly outshines its competitors. Abductive accounts — inference to the best explanation — are still stalled by the demand for a precise account of their central notion, explanation, whereas Bayesian confirmation theory suffers from the reverse: excessive elucidation of its central notion, probability. This precision and depth of articulation is part of what enables the challenges to Bayesian confirmation theory to be mounted. For it is only after a position is clearly articulated that its weaknesses may also be discerned. When a new Goliath enters the arena, many would be Davids come with their pebbles.
ACKNOWLEDGEMENTS I am grateful to Malcolm Forster for helpful discussion on an earlier draft. 50 If a tenfold boost in probability is sought, so q/p = 10, we must have p ≤ 0.1 to start with and the formula assures us that the probability of stopping is less than (p/q − p)/(1 − p) = (0.1 − p)/(1 − p) < 0.1.
436
John D. Norton
BIBLIOGRAPHY [Achinstein, 2001] P. Achinstein. The Book of Evidence. New York: Oxford University Press, 2001. [Aczel, 1966] J. Aczel. Lectures on Functional Equations and their Applications. New York: Academic Press, 1966. [Artzenius, manuscript] F. Arntzenius. Goodman’s and Dorr’s Riddles of Induction. [Bacchus et al., 1990] F. Bacchus, H. E. Kyburg, Jr., and M. Thalos. Against Conditionalization, Synthese, 85, pp. 475-506, 1990. [Bartha, 2004] P. Bartha. Countable Additivity and the de Finetti Lottery, British Journal for the Philosophy of Science, 55, pp. 301-321, 2004. [Bartha and Johns, 2001] P. Bartha and R. Johns. Probability and Symmetry, Philosophy of Science, 68 (Proceedings), pp. S109-S122, 2001. [Blake et al., 1960] R. M. Blake, C. J. Ducasse, and E. H. Madden. Theories of Scientific Method: The Renaissance Through the Nineteenth Century. E. H. Madden, ed. Seattle: University of Washington Press, 1960. [Brown, 1994] H. Brown. Reason, Judgment and Bayes’s Law, Philosophy of Science, 61, pp. 351-69, 1994. [Browne, 2000] M. W. Browne. Cross-Validation Methods, Journal of Mathematical Psychology 44, pp. 108-132, 2000. [Carnap, 1950] R. Carnap. Logical Foundations of Probability. Chicago: University of Chicago Press, 1950. [Cheesman, 1986] P. Cheeseman. Probabilistic versus Fuzzy Reasoning, pp. 85-102 in L. N. Kanal and J. F. Lemmer, eds., Uncertainty in Artificial Intelligence: Machine Intelligence and Pattern Recognition. Volume 4. Amsterdam: North Holland, 1986. [Cox, 1961] R. T. Cox. The Algebra of Probable Inference, Baltimore: Johns Hopkins Press, 1961. [Diaconis and Zabell, 1982] P. Diaconis and S. Zabell. Updating Subjective Probability Journal of the American Statistical Association. 77, pp. 822-30, 1982. [Diaconis and Zabell, 1986] P. Diaconis and S. Zabell. Some Alternatives to Bayes’s Rule, pp. 25-38 in Information and Group Decision Making: Proceedings of the Second University of California, Irvine, Conference on Political Economy. B. Grofman and G. Owen, eds. Greenwich, Conn,: JAI Press, 1986. [Duhem, 1969] P. Duhem. To save the phenomena, an essay on the idea of physical theory from Plato to Galileo. Chicago, University of Chicago Press, 1969. [Earman, 1992] J. Earman. Bayes or Bust. Cambridge, MA: Bradford-MIT, 1992. [Earman and Salmon, 1992] J. Earman and W. Salmon. The Confirmation of Scientific Hypotheses, ch. 2 in Introduction to the Philosophy of Science. M. H. Salmon, et al. PrenticeHall, 1992; repr. Indianapoli: Hackett, 1999. [Eells and Fitelson, 2002] E. Eells and B. Fitelson. Symmetries and Asymmetries in Evidential Support, Philosophical Studies, 107, pp. 129-142, 2002. [Edwards, 1972] A. W. F. Edwards. Likelihood. London: Cambridge University Press, 1972. [Ellsberg, 1961] D. Ellsberg. Risk, Ambiguity, and the Savage Axioms, Quarterly Journal of Economics, 75, pp. 643-69, 1961. [Feinstein, 1977] A. R. Feinstein. Clinical Biostatistics XXXIX. The Haze of Bayes, the Aerial Palaces of Decision Analysis, and the Computerized Ouija Board, Clinical Pharmacology and Therapeutics 21, No. 4, pp. 482-496, 1977. [de Finetti, 1937] B. de Finetti. Foresight: Its Logical Laws, Its Subjective Sources, trans. H. E. Kyburg, in H. E. Kyburg and H. Smokler (eds), Studies in Subjective Probability, New York: John Wiley & Sons, 1964, pp. 93-158. [Fine, 1973] T. L. Fine. Theories of Probability. New York: Academic Press, 1973. [Fishburn, 1986] P. C. Fishburn. The Axioms of Subjective Probability, Statistical Science, 1, pp. 335-58, 1986. [Forster, 2002] M. Forster. Predictive Accuracy as Achievable Goal of Science, Philosophy of Science, 69, pp. S124-34, 2002. [Forster, 2006] M. Forster. Counterexamples to a Likelihood Theory of Evidence, Minds and Machines, 16, 319–338, 2006.
Challenges to Bayesian Confirmation Theory
437
[Forster, 2007] M. Forster. A Philosopher’s Guide to Empirical Success, Philosophy of Science, 74, pp. 588–600, 2007. [Forster and Sober, 1994] M. Forster and E. Sober. How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide, British Journal for the Philosophy of Science, 45, pp. 1-35, 1994. [Forster and Sober, 2004] M. Forster and E. Sober. Why Likelihood? pp. 153-190 in M. L. Taper and S. R Lee, The Nature of Scientific Evidence: Statistical, Philosophical, and Empirical Considerations. Chicago and London: University of Chicago Press, 2004. [Galavotti, 2005] M. C. Galavotti. Philosophical Introduction to Probability. Stanford: CSLI Publications, 2005. [Gillies, 2000] D. Gillies. Philosophical Theories of Probability. London: Routledge, 2000. [Glymour, 1980] C. Glymour. Theory and Evidence. Princeton: Princeton University Press, 1980. [H´ ajek, 2003] A. H´ ajek. What Conditional Probability Could Not Be, Synthese, 137, pp. 273323, 2003. [H´ ajek, 2008] A. H´ ajek. Arguments for — or against — Probabilism? British Journal for the Philosophy of Science, 59, pp. 793-819, 2008. [Hempel, 1945] C. G. Hempel. Studies in the Logic of Confirmation, Mind, 54, pp. 1-26, 97-121, 1945; reprinted with changes, comments and Postscript (1964) in Carl G. Hempel, Aspects of Scientific Explanation. New York: Free Press, 1965, Ch.1. [Horwich, 1982] P. Horwich. Probability and Evidence. Cambridge: Cambridge University Press, 1982. [Howson, 1997] C. Howson. A Logic of Induction, Philosophy of Science, 64, pp.268-90, 1997. [Howson and Urbach, 2006] C. Howson and P. Urbach. Scientific Reasoning: The Bayesian Approach. 3rd ed. La Salle, Illinois: Open Court, 2006. [Humphreys, 1985] P. Humphreys. Why Propensities Cannot be Probabilities, Philosophical Review, 94, pp. 557-570, 1985. [Jaynes, 2003] E. T. Jaynes. Probability Theory: The Logic of Science, Cambridge: Cambridge University Press, 2003. [Jeffrey, 1983] R. Jeffrey. The Logic of Decision. 2nd ed. Chicago: University of Chicago Press, 1983. [Jeffreys, 1961] H. Jeffreys. Theory of Probability. 3rd ed. Oxford: Clarendon Press, 1961. [Joyce, 1998] J. Joyce. A Nonpragmatic Vindication of Probabilism, Philosophy of Science, 65, pp. 575-603, 1998. [Kadane and O’Hagan, 1995] J. B. Kadane and A. O’Hagan. Using Finitely Additive Probability: Uniform Distributions on Natural Numbers, Journal of the American Statistical Association. 90, pp. 626-31, 1995. [Kadane et al., 1996] J. B. Kadane, M. J. Schervish, and T. Seidenfeld. Reasoning to a Foregone Conclusion, Journal of the American Statistical Association, 91, pp. 1228-36, 196. [Kass and Wasserman, 1996] R. E. Kass and L. Wasserman. The Selection of Prior Distributions by Formal Rules, Journal of the American Statistical Association, 91, pp. 1343-70, 1996. [Kaplan, 1998] M. Kaplan. Decision Theory as Philosophy. Cambridge: Cambridge University Press, 1998. [Kelly, 1996] K. T. Kelly. The Logic of Reliable Inquiry. New York: Oxford, 1996. [Kelly and Glymour, 2004] K. T. Kelly and C. Glymour. Why Probability Does not Capture the Logic of Justification, in C. Hitchcock, ed., Contemporary Debates in Philosophy of Science.Malden, MA: Blackwell, 2004. [Keynes, 1921] J. M. Keynes. A Treatise of Probability. London: Macmillan, 1921; repr. New York, AMS Press, 1979. [Kyburg, 1959] H. E. Kyburg. Probability and Randomness I and II, Journal of Symbolic Logic, 24, pp. 316-18, 1949. [Kyburg, 1987] H. E. Kyburg. Bayesisan and non-Bayesisan Evidential Updating, Artificial Intelligence, 31, pp. 271-93, 1987. [Kyburg and Teng, 2001] H. E. Kyburg and C. M. Teng. Uncertain Inference. New York: Cambridge University Press, 2001. [Laraudogoitia, 2004] J. P. Laraudogoitia. Supertasks, The Stanford Encyclopedia of Philosophy (Winter 2004 Edition), Edward N. Zalta (ed.),2004. $http://plato.stanford.edu/archives/ win2004/entries/spacetime-supertasks/.
438
John D. Norton
[Laudan, 1997] L. Laudan. How About Bust? Factoring Explanatory Power Back into Theory Evaluation, Philosophy of Science, 64, pp. 306-16, 1997. [Levi, 1974] I. Levi. On Indeterminate Probabilities, Journal of Philosophy, 71, pp. 391-418, 1974. [Levi, 1980] I. Levi. The Enterprise of Knowledge. Cambridge, MA. MIT Press, 1980. [Lewis, 1980] D. Lewis. A Subjectivist’s Guide to Objective Chance, in R.C. Jeffrey (ed.), Studies in Inductive Logic and Probability. Berkeley: University of California Press, pp. 263-93, 1980. [Mayo, 1996] D. Mayo. Error and the Growth of Knowledge. Chicago: University of Chicago Press, 1996. [Mayo, manuscript] D. Mayo. Three Battle Waves in this Philosophy of Statistics. [McGee, 1994] V. Mc Gee. Learning the Impossible, pp. 179-199 in E. Eels and B. Skyrms, eds., Probability and Conditionals: Belief Revision and Probability Conditions. Cambridge: Cambridge University Press, 1994. [Mellor, 2005] D. H. Mellor. Probability: A Philosophical Introduction. Abingdon: Routledge, 2005. [Mill, 1891] J. S. Mill. A System of Logic: Ratiocinative and Inductive. 8th ed. Honolulu, Hawaii: University Press of the Pacific, 1892/2002. [Norton, 2003] J. D. Norton. A Material Theory of Induction Philosophy of Science, 70, pp. 647-70, 2003. [Norton, 2005] J. D. Norton. A Little Survey of Induction, in P. Achinstein, ed., Scientific Evidence: Philosophical Theories and Applications. Johns Hopkins University Press. pp. 934, 2005. [Norton, 2007] J. D. Norton. Probability Disassembled, British Journal for the Philosophy of Science, 58, pp. 141-171, 2007. [Norton, 2007 a] J. D. Norton. Disbelief as the Dual of Belief, International Studies in the Philosophy of Science, 21, pp. 231–252, 2007. [Norton, 2008] J. D. Norton. Ignorance and Indifference. Philosophy of Science, 75, pp. 45–68, 2008. [Norton, 2008 a] J. D. Norton. The Dome: An Unexpectedly Simple Failure of Determinism, Philosophy of Science, 74, pp. 786–98, 2008. [Norton, forthcoming] J. D. Norton. There are No Universal Rules for Induction, Philosophy of Science, forthcoming. [Norton, forthcoming a] J. D. Norton. Cosmic Confusions: Not Supporting versus Supporting Not-, Philosophy of Science, forthcoming. [Norton, forthcoming b] J. D. Norton. Deductively Definable Logics of Induction, Journal of Philosophical Logic, forthcoming. [Popper, 1959] K. R. Popper. Logic of Scientific Discovery. London: Hutchinson, 1959. [Popper and Miller, 1983] K. R. Popper and D. Miller, David (1983) A Proof of the Impossibility of Inductive Logic, Nature, 302, pp. 687-88, 1983. [Popper and Miller, 1987] K. R. Popper and D. Miller. Why Probabilistic Support is not Inductive, Philosophical Transactions of the Royal Society of London, Series A, Mathematical and Physical Sciences. 321, pp. 569-91, 1987. [Royall, 1997] R. M. Royall. Statistical Evidence: A Likelihood Paradigm. London: Chapman & Hall, 1997. [Salmon, 1966] W. Salmon. The Foundation of Scientific Inference. Pittsburgh: University of Pittsburgh Press, 1966. [Savage, 1972] L. J. Savage. The Foundations of Statistics, Second revised ed. New York: Dover, 1972. [Savage et al., 1962] L. J. Savage et al. The Foundations of Statistical Inference. London: Methuen, 1962. [Schick, 1986] F. Schick. Dutch Bookies and Money Pumps,’ Journal of Philosophy, 83, pp. 112-119, 1986. [Seidenfeld, 1979] T. Seidenfeld. Why I am not an Objective Bayesian: Some Reflections Prompted by Rosenkrantz, Theory and Decision, 11, pp.413-440, 1979. [Shortliffe and Buchanan, 1985] E. H. Shortliffe and B. G. Buchanan. A Model of Inexact Reasoning in Medicine, Ch. 11 in Buchanan, B. G., and Shortliffe, E. H., eds., Rule-Based Expert Systems: The MYCIN Experiments and the Stanford Heuristic Programming Project. Reading, MA: Addison-Wesley, 1985.
Challenges to Bayesian Confirmation Theory
439
[Skyrms, 1975] B. Skyrms. Choice and Chance: An Introduction to Inductive Logic. 2nd ed. Encino, CA: Dickenson, 1975. [Tversky and Kahneman, 1982] A. Tversky and D. Kahneman. Judgments of and by representativeness, pp. 84-98 in D. Kahneman, P. Slovic & A. Tversky (Eds.), Judgment under uncertainty: Heuristics and biases. Cambridge, UK: Cambridge University Press, 1982. [van Fraassen, 1990] B. Van Fraassen. Laws and Symmetry. Oxford: Clarendon, 1990. [van Inwagen, 1996] P. van Inwagen. Why is There Anything at All? Proceedings of the Aristotelian Society, Supplementary Volumes, 70, pp. 95–120, 1996. [Walley, 1991] P. Walley. Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, 1991. [Wasserman, 2000] L. Wasserman. Bayesian Model Selection and Model Averaging, Journal of Mathematical Psychology 44, pp. 92-107, 2000. [Williamson, 1999] J. Williamson. Countable Additivity and Subjective Probability, British Journal for the Philosophy of Science, 50, pp. 401-16, 1999. [Zadeh, 1978] L. Zadeh. Fuzzy Sets as the Basis for a Theory of Possibility, Fuzzy Sets and Systems, 1, pp. 3-28, 1978. [Zadeh, 1986] L. Zadeh. Is Probability Theory Sufficient for Dealing with Uncertainty in AI: A Negative View, pp. 103-116 in L. N. Kanal and J. F. Lemmer, eds., Uncertainty in Artificial Intelligence: Machine Intelligence and Pattern Recognition. Volume 4. Amsterdam: North Holland, 1986.
This page intentionally left blank
BAYESIANISM AS A PURE LOGIC OF INFERENCE Colin Howson The calculus of probability can say absolutely nothing about reality ... As with the logic of certainty, the logic of the probable adds nothing of its own: it merely helps one to see the implications contained in what has gone before (Bruno de Finetti, Theory of Probability, vol.1, p.215).
1
INTRODUCTION
The seventeenth century saw the beginning of that great scientific revolution from which mathematics emerged as the language of a unified physics. This was the continuous mathematics of the differential and integral calculus and (eventually) the very rich theory of complex numbers and analytic functions. But the late seventeenth century also saw another seminal scientific development: a connection forged between the idea of a graded probability and another new branch of mathematics, the discrete mathematics of combinations and permutations. Starting from mundane beginnings,1 it developed into a completely novel science of mathematical probability and statistics. What was not realised for a long time after the first seminal treatises of Huygens, Montmort and James Bernoulli was that two quite distinct notions seemed to be subsumed under the common title ‘probability’, notions which today we now variously distinguish by the names of ‘epistemic probability’ and ‘physical probability’, ‘Bayesian probability’ and ‘chance’, and by Carnap ‘probability1 ’ and ‘probability2 ’.2 The focus of this discussion will be on the former, and in particular on the question of whether its laws should be classified as laws of logic. The idea that there might be an intimate relationship between logic and probability, at any rate epistemic probability, has been the subject of exploration and controversy for over three centuries. Both disciplines specify rules of valid non-domain-specific reasoning, and it would seem a reasonable question why one should be distinguished as logic and the other not. The first great post-Renaissance treatise on 1 ‘A problem in games of chance, proposed to an austere Jansenist [Pascal] by a man of the world [the Chevalier de M´ er´ e, an assiduous gambler] was the origin of the calculus of probabilities.’ ([Poisson 1837, p.1]; my translation) 2 Poisson, who was the first to explicitly note the duality, used the words ‘probability’ and ‘chance’ [1837, p.80].
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
442
Colin Howson
logic, the so-called Port Royal Logic (La logique, ou l’art de penser ), subsumed both the logic of certainty (deductive logic) and that of uncertainty (probability) under the common heading of logic, and the idea that they are complementary subdisciplines runs like a thread, sometimes visible, sometimes not, through the subsequent history. Be that as it may, most contemporary Bayesians see their discipline less as a part of logic, which in common with most contemporary deductive logicians they regard as comprising just deductive logic, than of a general theory of rational belief and decision. Savage, for example, tells us in his classic Bayesian text [1954] that he is about to develop a theory of ‘the behaviour of a “rational” person with respect to decisions’ [1954, p.7]. In locating Bayesian probability within the theoretical milieu of utility and rational decision he was, of course, following one of the two great pioneers of modern Bayesianism, Frank Ramsey, who was the first to develop the theory of probability within an axiomatic theory of preference. The other, Bruno de Finetti, is best known for pointing out that if you use your probabilityevaluations as the basis for your assessment of fair odds (i.e. odds at which you reckon neither side of the bet has an advantage: this relation between probabilityevaluations and fair odds goes back to the beginnings of the theory of probability) then the constraints imposed by the finitely additive probability calculus protect you from making bets which you could be made to lose in any event, and more generally from making inadmissible decisions (an admissible decision is one which cannot be dominated — i.e. there is no alternative decision which will produce a better outcome, in terms of gain or loss, regardless of what the true state of affairs is). Probability-evaluations possessing these features de Finetti termed coherent.3 But, paradoxically, it is in his work that another, apparently not at all decisiontheoretically oriented view of coherence emerges, as a species of intrinsic consistency which according to de Finetti the evaluations possess - or not - independently of who makes them or why: [I]t is better to speak of coherence (consistency) of probability evaluations rather than of individuals . . . because the notion belongs strictly to the evaluations and only indirectly to the individuals. [1937, p.103, footnote (b)] The parenthetical gloss of ‘consistency’ for ‘coherence’ is de Finetti’s own, which he reinforced with the observation that incoherent probability assignments actually ‘contain an intrinsic contradiction’ (ibid.). This is not the only reference to a specifically logical character of coherent evaluations in that paper. Its title, in the original French, is ‘ La pr´evision: ses lois logiques, ses sources subjectives’ (‘Foresight: its logical laws, its subjective sources’ in the Kyburg translation,4 ) 3 There are several qualifications that need to be added to this brief summary of de Finetti’s theory of coherence. I shall do so later. 4 ‘Foresight’ is not the best of translations of what was for de Finetti a technical term meaning roughly ‘expectation’.
Bayesianism as a Pure Logic of Inference
443
and ‘logical laws’ was by no means an idle phrase: in another paper published a year earlier, de Finetti wrote that it is beyond doubt that probability theory can be considered as a multivalued logic ... and that this point of view is the most suitable to clarify the foundational aspects of the notion and the logic of probability’ [1936, p.183]; quoted in [Coletti and Scozzafava 2002, p.61]. Yet despite such advocacy from probably the most influential of all modern Bayesians, a view of the rules of probability as furnishing an authentic logic, though with values in [0,1] rather than {0,1} (deductive logic), not only did not go on to command widespread acceptance, but even de Finetti himself seemed in his later work to have relinquished it in favour of the view now almost universally associated with his work, that the rules are merely prudential safeguards, protecting the agent from ‘decisions whose consequences are manifestly undesirable (leading to certain loss)’ [1974, vol. 1, p.72]. But there is more to the general failure to see in Bayesian probability an authentic logic than de Finetti’s own personal evolution from apparent logicist to determined decision-theorist. There are also some important and seemingly recalcitrant facts: modern deductive logic deals in discrete (two) truth-values, and its central notions of consistency and consequence as properties of, and relations between, sentences seem to have no analogues in the Bayesian formalism. One can certainly say that a set of probability-evaluations is consistent if it obeys the probability axioms, but this is to give the word ‘consistent’ an altogether different meaning, or so it seems, than the one it has in deductive logic. It certainly seemed so to Henry Kyburg, who in his translator’s preface to the English version of de Finetti’s 1937 paper tells us that it was for this reason that he translated de Finetti’s original ‘coh´erence’ as ‘coherence’: “Consistency” is used by some English and American authors, and is perfectly acceptable to de Finetti, but it is ambiguous (from the logician’s point of view) because, as applied to beliefs, it has another very precise and explicit meaning in formal logic. As the words are used in this translation, to say that a body of beliefs is “consistent” is to say (as in logic) that it contains no two beliefs that are contradictory. To say that in addition the body of beliefs is “coherent” is to say that the degrees of belief satisfy certain further conditions ([1964, p.95]; parenthesis in original)5 I shall show later that the apparently very strong disanalogy between deductive consistency and probabilistic ‘coherence’ that Kyburg thought he had discerned vanishes on closer examination. This is remarkable enough, but even more remarkable, as we shall see, is the very close formal kinship that emerges between the two 5 Kyburg should strictly have used the word ‘implies’ rather than ‘contains’. David Makinson pointed out that it is not necessary for inconsistency for a body of beliefs to contain two contradictory beliefs.
444
Colin Howson
notions, which at a certain level of description are actually identical. This in my opinion fully justifies the use of the one word ‘consistency’ to apply to both. We shall also see that although de Finetti did not describe his work in the language of modern logic, the concepts he forged and the results he proved are fundamentally logical in character. This is not to say that in the preceding three centuries people did not work hard to elicit some fruitful relationship between logic and probability. They did, but despite their efforts very little of a positive nature emerged, and when it did it was the result of taking a quite different approach. The two fundamental notions of modern deductive logic are (semantic) consistency and (semantic) logical consequence, which in classical logic at least are interdefinable. Practically from the start it had been agreed that in some sense probability generalised deductive logic (‘the logic of certainty’ as it was often described), but it was also agreed that it was the deductive consequence relation that was generalised. For reasons which will become apparent in the next few sections this was the wrong choice, but it is instructive to see why it does not work, particularly since there is still a large number of people who think it does. 2
BOLZANO AND PARTIAL ENTAILMENT
The first significant marker in the probability-as-generalised-deduction programme was put down in the mid-nineteenth century by the mathematician and philosopher Bernard Bolzano. In fact, Bolzano made a very interesting discovery. Recorded in his Theory of Science 1837, section 161), his discovery was this. Suppose we partition the components of declarative sentences into fixed and variable terms (or fixed and variable ‘ideas’, as he called them), rather as we do today when we parse sentences into logical and extralogical components. Bolzano considered a partition according to such a logical/extralogical criterion, calling those sentences which remain uniformly true under variation of their extralogical components ‘analytic’, but in general he treats the partition merely as an exogenously given parameter. His criterion for the valid deduction of a conclusion C from finitely many premises A1 , . . ., An , relative to any such partition, is that there is no way of interpreting the variable terms which results in all the Ai being jointly true and C false. We might paraphrase this in our current terminology as: C is validly inferred from A1 , . . ., An , given the partition, if every model of the set {A1 , . . ., An } is a model of C (where a model is just an interpretation making the sentences in question jointly true). If Bolzano had stopped there he might have already done enough to merit being regarded as the progenitor of modern model theory. But he went further, pointing out that, where the sentences in question admit only finite numbers of distinct models, we can consider not just the two limit cases (i) every model of {A1 , . . ., An } is a model of C, and (ii) no model of {A1 , . . ., An } is a model of C (in the second case C is of course inconsistent with those premises), but also the case where an intermediate proportion of C’s models are also models of {A1 , . . ., An }; i.e., setting
Bayesianism as a Pure Logic of Inference
445
Π = {A1 , . . ., An } we can consider a spectrum of different possible values of the quotient |Mod(C)∩Mod(Π)|/|Mod(C)|, where |.| around a set signifies the number of its members, i.e. its cardinality, Mod(Σ) is the set of all models of a set Σ of sentences,6 and ∩ is set-intersection. Bolzano terms this, in remarkably modern language, the relation of relative satisfiability. This proportion will of course be a number between 0 and 1 including the two bounds, and it is not difficult to see that the proportion of models of the disjunction ‘C or D’ which are also models of any set Σ of sentences is the sum of the proportion of models of C which are models of Σ and the proportion of models of D which are models of Σ. Bolzano proves this and many of the other results we now call theorems of the probability calculus. We can illustrate Bolzano’s idea with a simple example. Suppose a die is to be thrown once, and let A be the sentence ‘The die comes to rest with a numeral between 1 and 6 uppermost’ (again we shall avoid tiresome pedantry by just writing n instead of the strictly correct ‘n’). Let C be ‘the outcome is of type Φ’ where Φ is some canonical way of describing the outcome of the throw. Bolzano did not have the flexible symbolic notation of modern formal logic with which to illustrate his ideas adequately, but we do and at the cost of being charged with ahistoricity I shall use them now. Accordingly, in the die example we will employ a vocabulary whose ‘fixed terms’ are just the usual logical constants with their customary notation ∨ (‘or’), ¬ (‘not’), ∧ (‘and’), ∃ (‘some’), ∀ (‘all’), and whose ‘variable terms’ are a constant a (to be read ‘the throw of the die’), and six monadic predicates B1 (‘comes to rest 1 uppermost’), B2 (‘comes to rest 2 uppermost’), . . . , and B6 (‘comes to rest 6 uppermost’). One of the few serious symbolic defects of modern classical logic is that the domain of the variables is never signalled by an explicit symbol, though it is just as ‘variable’ as the rest of the extralogical material. In this case the intended domain, or ‘universe of discourse’ as it is sometimes called, is the set consisting just of the single throw of the die. We can write A in this notation, together with multiple conjunction and disjunction operators, as _ ^ ∀x(x = a) ∧ ∀x[ Bi (x) ∧ (Bi (x) → ¬Bj (x))] 0 c(Q(an+1 ), e).
458
Colin Howson
In other words, exchangeability is a modelling assumption, appropriate for some situations, i.e. where one’s background knowledge entitles one to regard the position of individuals in an ordered sample as irrelevant, but not for others where there is believed to be dependency from trial to trial, as in Markov chains for example. Saying that some constraint is logical because it can be described in purely logico-mathematical terms therefore begs the question. So too, though possibly less obviously, do the range-measure probability functions defined in the way Bolzano prescribed. It is not just that the phenomenon of language-dependence does not arise in deductive logic, nor even that there are serious problems with infinity: there is also the fundamental question of why the different models of a set of sentences should be counted equally. No appeal to informational symmetry can answer this question, because that too begs the same question. Implicit in such measures is a substantive principle, the Principle of Indifference, no less substantive — and arbitrary — because it is often concealed in the formalism. It might be (and sometimes is) objected that in assigning different weights to the members of the possibility-space one would be equally, if not more, begging the question. What could justify such an a priori discrimination? The objection fails to take into account the fact that there will always be a priori discrimination because of the inevitable language-dependence associated with this measure. As we have seen, in the logical space determined by two individuals and one predicate P , but without individuating names, ‘P is instantiated twice’ has the same probability as ‘P is instantiated once’ (1/3), whereas if one adds the individuating names the sentences are ranked as differently in probability (1/4 and 1/3 respectively). Objections like these have convinced many if not most commentators that the idea of an epistemically neutral logical probability metric which tells us how, for given data, we should evaluate the probabilities of contingent propositions is untenable: the constraints imposed in order that various consequences regarded as epistemic goods can be delivered beg the question. Without such constraints, however, we are left with the probability axioms and nothing more, and if this is logic then, so it would seem, it is too weak to be of any use whatever. Indeed, it would seem to be inevitably a subjective matter how, within those very liberal constraints, one chooses probability-values. One of the surprising, and at first sight paradoxical, findings to emerge in the aftermath of Carnap’s work was a recognition that indeterminacy of this type need not preclude a purely subjective theory of probability from being seen as an authentically logical probability. Carnap’s ‘logical probability’ has speciated into a variety of what have become called probability logics, mostly retaining the original Carnapian feature that probabilities are assigned to the sentences (and sometimes even open formulas formulas also) of an appropriate formal language, usually some extension of a first order language with equality. These developments fall into two broad classes, which in their different ways mirror developments in deductive logic itself, in particular its treatment of the modalities ‘necessarily’ and its dual ‘possibly’. One aspect of necessary truth is automatically provided for in the model
Bayesianism as a Pure Logic of Inference
459
theory of first and higher order logic, where it is identified with the category of logical truth. But in addition there are the modal logics which generalise first order logic by adding a new sentential operator to the first order vocabulary,18 where the usual formula-closure conditions mean that one can have arbitrary degrees of embedding of modal operators. The post-Carnapian treatments of logical probability reflect this bifurcation, with one generalising the model theory of classical logic and the other adding a probability-operator to the object language.19 The first of these ways of extending classical logic seems to connect more naturally with the developments in Bayesian probability, due to Ramsey and in particular de Finetti, whose central concern was the consistency of sets of probability assignments, and it is on that that I shall largely focus on in the remaining sections. 5
FROM LOGICAL PROBABILITY TO PROBABILISTIC LOGIC
I observed earlier that one can take either deductive consequence or deductive consistency as the primary deductive concept, since these are interdefinable. Bolzano, Keynes and Carnap saw in the formal properties of a conditional probability function a natural generalisation of the former. It was a seductive analogy, but in retrospect the wrong one. But a more promising way had already been pointed, if implicitly, when in a memorable phrase Frege called the rules of classical deductive logic ‘laws of truth’ [1956]. After Tarski’s seminal work Frege’s characterisation could be refined to ‘the laws of truth-valuations’. Truth-valuations (with respect to any given structure) can be represented by functions from sentences in a formalised language into the set {0,1}, and a fundamental if logically trivial property of truth-valuations is that semantically equivalent sentences receive the same value. This means that factoring the class of sentences of the appropriate language by equivalence yields a Boolean algebra, and the valuations can therefore be viewed as defined on the algebra itself. We have the elements of a familiar structure for probabilists: a Boolean algebra on which a finitely additive measure is defined. It remains to take the decisive step and allow the measure to take arbitrary values in the entire unit interval. In addition, an almost trivial rewriting of classical first order logic shows that deductive consistency can be seen equivalently as a property of sets of truth-value assignments. Because in classical logic there are only two truth-values they don’t have to be, and usually are not, explicitly represented in the primitive vocabulary (negation ‘defines’ falsity). But in his book [1968] developing a very user-friendly version of semantic tableaux for sentences in ordinary first order languages, Raymond Smullyan also presented an elegant account of consistency and consequence for what he called ‘signed’ sentences, i.e. first order sentences sentences to which can be adjoined the letters T and F . If A is such a sentence, AT and AF can 18 In a celebrated paper Montague showed that if the usual modal axioms are preserved together with the assumption that necessary sentences are true, then necessity cannot be an objectlanguage predicate under any suitable codings, e.g. G¨ odel numbering, of sentences. 19 As for example does Halpern [1990].
460
Colin Howson
of course be equivalently represented by A = 1 and A = 0.20 A set of signed propositional sentences is true in a model, i.e. a full Boolean valuation, just in case the model assigns the sentences those values, and the set is consistent just in case it has a model. This is of course highly suggestive: once consistency is seen to be a property of value-assignments to sentences, and hence to members the sentence Boolean algebra of the language obtained by factoring by equivalence, the generalisation to a corresponding account for value-assignments in more extensive number-domains, like [0,1], is of probabilistic consistency and consequence in the probabilistic case is obvious. Deductive consistency and probabilistic consistency can then be seen to be subspecies of a single species of consistency, the familiar notion of consistency as the solvability of equations subject to constraints, with the constraints being the differentiae: in the deductive case the clauses of a classical truth-definition, and in the probabilistic case the laws of probability.21 That said, signing propositions with arbitrary real numbers in [0, 1] is clearly a more complex affair than signing them with values in {0, 1}, a problem further complicated by the fact that one might want not just to specify values for selected sentences but assert that the values satisfy various algebraic relations specifiable within some suitable mathematical language/theory T . Another problem is that the typical structure investigated in mathematical probability is a σ-algebra with a countably additive probability function defined on it. A first order language L determines only an ordinary Boolean algebra (via the map from sentences A to either Mod(A), the class of models of A, or to the corresponding member |A| of the Lindenbaum sentence algebra of L), not closed under countable sup and inf. These issues were addressed in a paper by Dana Scott and Peter Krauss, published in 1966 and building on earlier work by Haim Gaifman [1964]. Firstly, they took as the language to which probability assignments are to be made an infinitary language of type Lω1 ,ω in the Lκ,λ family, where κ and λ are infinite cardinal numbers and Lκ,λ is a first order language closed under the formation of fewer than κ conjunctions and disjunctions, and strings of quantifiers of length less than λ. Their particular choice of first order basis language L contained, apart from identity, just one binary relation R (since all of mathematics can be expressed in first order set theory, and first order set theory has just one binary relation symbol for membership as its only extralogical vocabulary, this allows a degree of universality to the range of propositions that can be expressed). 20 For fairly obvious reasons no embedding is permitted: despite what looks like a blurring of the object-language/metalanguage distinction, this is a calculus of truth-value assignments to object-language sentences in a semantically open language. 21 Cf de Finetti:
Just as the ordinary logic of two values is the necessary instrument of all reasoning where only the fact that an event happens or does not happen enters in, so the logic of the probable, the logic of a continuous scale of values, is the necessary instrument of all reasoning into which enters, visibly or concealed, a degree of doubt’ [1937, p.155]
Bayesianism as a Pure Logic of Inference
461
Lω1 ,ω languages have of course much greater expressive power than first order languages.22 For example, the standard model of arithmetic can be defined by a single sentence. More relevant to the present discussion is the fact that the Lindenbaum algebra is isomorphic to a σ-field of sets, of the form Mod(ϕ) for sentences ϕ in Lω1 ,ω , and as a consequence many of the characteristic structures of mathematical probability theory are definable. Suppose for example that the sample space, call it C, is the set, 2ω , the set of all functions from ω (i.e. the natural numbers) into {0,1}, which in descriptive set theory is often identified with the real numbers. If the base first order language for Lω1 ,ω contain a denumerably infinite set of constants ai and a one-place predicate symbol B(x), the members of C can be identified with the (up to isomorphism) of the appropriate V structures V W type in Mod(σ) where σ = j 0, however small, if {Bi : i = 1, 2, 3, ...} is a disjoint family then for some n, P (B1 ∨ ... ∨ Bn } > 1 − ε. In particular, the distribution cannot be uniform. This is of course not true for either a finite set or an uncountably infinite interval of real numbers, as de Finetti observes. Indeed, suppose you have a uniform distribution over [0, 1], and then learn that the random quantity in question is a rational number in that interval. If countable additivity is assumed you are compelled to adopt the skewed distribution over the rationals under some enumeration. As de Finetti points out, no metrical consideration can justify this bias. The probability formalism itself produces it, in effect manufacturing information where there was none before [1974, vol. 1, pp.121-123] — contrary to the fundament principle above. It might be objected that deductive logic itself supplies a counterexample to de Finetti’s principle, since ∃x(A(x) ∨ ¬A(x)), i.e. something exists, is a theorem of classical first order logic. That is true, and it is because it seems to contravene the principle that logic should be empty of content, and in particular not endorse unconditional existential assertions, that several people have proposed a variant, 32 [1972,
p.89].
Bayesianism as a Pure Logic of Inference
469
called free logic, which is the closest one can get to first order logic without entailing the thesis that something necessarily exists. This is not the place to go further into that issue, but it is for much the same reason that most probabilists endorse the axiom of countable additivity — computational simplicity - that most logicians are reluctant to go over to free logic. But I think that de Finetti’s response would be that they are all merely sacrificing principle for ease, a judgment that in my opinion it is difficult to fault. 8
CONCLUSION
I have ended this discussion as I began it, with de Finetti. I believe he rightly deserves to be given such prominence because, despite the strongly decision-theoretic focus he gave his later work, he is so far the only person to provide, if at times implicitly, a defensible account of Bayesian probability as an authentic logic. His account was never systematically developed in one place or at one time — one has to pick up what appear to be sporadic obiter dicta scattered across his works and across decades. Nor was it couched in the terminology of modern logic. But if the account given in the preceding sections is correct then that is not very difficult to do, and in a way that exhibits deductive and probabilistic logic as parallel disciplines distinguished only in the nature of the constraints they impose on consistent evaluations. What more than anything else has militated against a recognition of the close kinship of the two disciplines (compare the quote from Kyburg in section 1.) has been that they seem to be ‘about’ such very different things: internal relations between sentences versus numerical assignments to propositions, ‘laws of truth’ versus subjective beliefs, and so on. De Finetti, unlike nearly everyone else, did not fail to see the wood for the trees, with a clarity of vision evidenced early on in the title of his most famous paper: ‘Prevision: its logical laws, its subjective sources’ (my emphasis). ACKNOWLEDGEMENTS I am grateful for very helpful comments by an anonymous referee. BIBLIOGRAPHY [Adams, 1975] E. W. Adams. The Logic of Conditionals, Dordrecht: Reidel, 1975. [Bayes, 1763] T. Bayes. An Essay Towards Solving a Problem in the Doctrine of Chances, Philosophical Transactions of the Royal Society of London, vol. 53, 370-418, 1763. [Bolzano, 1837] B. Bolzano. Wissenschaftslehre. Versuch einer ausf¨ uhrlichen und gr¨ ossentheils neuen Darstellung der Logik mit steter R¨ ucksicht auf deren bisherige Bearbeiter, 1837. Page references to the English translation, Theory of Science. Attempt at a Detailed and in the main Novel Exposition of Logic with Constant Attention to Earlier Authors, by Rolf George (Oxford: Blackwell, 1972). [Carnap, 1950] R. Carnap. Logical Foundations of Probability, Chicago: University of Chicago Press, 1950.
470
Colin Howson
[Carnap, 1952] R. Carnap. The Continuum of Inductive Methods, Chicago: University of Chicago Press, 1952. [Carnap and Jeffery, 1971] R. Carnap and R. C. Jeffrey, eds. Studies in Inductive Logic and Probability, volume 1, Berkeley: University of California Press, 1971. [Coletti and Scozzafava, 2002] G. Coletti and R. Scozzafava. Probabilistic Logic in a Coherent Setting, Dordrecht: Kluwer, 2002. [Cox, 1961] R. T. Cox. The Algebra of Probable Inference, Baltimore: The Johns Hopkins Press, 1961. [Dale, 1982] A. I. Dale. Bayes or Laplace? An Examination of the Origin and Early Application of Bayes’s Theorem, Archive for History of Exact Sciences, vol. 27, 33-47, 1982. [Dawid et al., 1973] A. P. Dawid, M. Stone, and J. V. Zidek. Marginalisation Paradoxes in Bayesian and Structural Inference, Journal of the Royal Statistical Society, B, 189-223, 1973. [Fagin, 1976] R. Fagin. Probabilities on Finite Models, Journal of Symbolic Logic, vol. 41, 50-58, 1976. [de Finetti, 1936] B. de Finetti. La logique de la probabilit´ e’, Actes du Congr` es International de Philosophie Scientifique, vol. IV, 1-9 1936. [de Finetti, 1937] B. de Finetti. La pr´ evision: ses lois logiques, ses sources subjectives, Paris, Institut Henri Poincar´ e, 1937. Page references to the English translation ‘Foresight: its Logical Laws, its Subjective Sources’, by H. E. Kyburg, in Studies in Subjective Probability, Second edition (1964), eds. H.E. Kyburg and H. Smokler, New York: Wiley, [de Finetti, 1972] B. de Finetti. Probability, Induction and Statistics, New York: Wiley, 1972. [de Finetti, 1974] B. de Finetti. Theory of Probability, vols. 1 and 2, 1974. (English translation of Teorie delle probabilit` a, Einaudi, 1970) [Fine, 1973] T. L. Fine. Theories of Probability, New York: Academic Press, 1973. [Frege, 1956] G. Frege. The Thought: A Logical Enquiry, Mind, vol. 65, 289-311, 1956. [Gaifman, 1964] H. Gaifman. Concerning Measures in First Order Calculi, Israel Journal of Mathematics, vol. 1, 1-18, 1964. [Gaifman and Snir, 1982] H. Gaifman and M. Snir. Probabilities Over Rich Languages, Testing and Randomness, The Journal of Symbolic Logic, vol. 47, 495-548, 1982. [Gillies, 2000] D. A. Gillies. Philosophical Theories of Probability, London: Routledge, 2000. [Halpern, 1990] J. Y. Halpern. An Analysis of First-Order Logics of Probability, Artificial Intelligence, vol. 46, 311-350, 1990. [Howson, 2000] C. Howson. Hume’s Problem: Induction and the Justification of Belief, Oxford: The Clarendon Press, 2000. [Howson, 2001] C. Howson. The Logic of Bayesian Probability, Foundations of Bayesianism, eds. D. Corfield and J. Williamson, Dordrecht: Kluwer, 137-161, 2001. [Howson, 2008] C. Howson. De Finetti, Countable Additivity, Coherence and Consistency, British Journal for the Philosophy of Science, vol. 59, 1-23, 2008. [Howson and Urbach, 2006] C. Howson and P. Urbach. Scientific Inference: the Bayesian Approach (3rd edition), Chicago: Open Court, 2006. [Jaynes, 1973] E. T. Jaynes. The Well-Posed Problem, Foundations of Physics, vol. 3, 413-500, 2973. [Jech, 1978] T. Jech. Set Theory, New York: Springer, 1978. [Keynes, 1921] J. M. Keynes. A Treatise on Probability, London: Macmillan, 1921. [Kolmogorov, 1933] A. N. Kolmogorov. Grundbegriffe der Wahrscheinlichkeitsrechnung, 1933. (Page References are to the English translation Foundations of the Theory of Probability, New York: Chelsea, 1951.) [Kuhn, 1962] T. S. Kuhn. The Structure of Scientific Revolutions, Chicago: University of Chicago Press, 1962. [Lewis, 1976] D. Lewis. Probabilities of Conditionals and Conditional Probabilities, Philosophical Review, LXXXV, 581-589, 1976. [Mura, 2009] A. Mura. Probability and the Logic of de Finetti’s Trievents, in Bruno de Finetti: Radical Probabilist, ed. M.C. Galavotti, 201-242, 2009. [Popper, 1959] K. R. Popper. The Logic of Scientific Discovery, New York: Harper & Row, 1959. [Ramsey, 1926] F. P. Ramsey. Truth and Probability, The Foundations of Mathematics, ed. R.B. Braithwaite, London: Kegan Paul, Trench, Trubner, 156-199, 1926. [R´ enyi, 1955] A. R´ enyi. On a New Axiomatic Theory of Probability, Acta Mathematica Academiae Scientiarum Hungaricae, vol. VI, 285-335, 1955.
Bayesianism as a Pure Logic of Inference
471
[Savage, 1954] L. J. Savage. The Foundations of Statistics, New York: Dover, 1954. [Scott and Krauss, 1966] D. Scott and P. Krauss. Assigning Probabilities to Logical Formulas, Aspects of Inductive Logic, eds. J. Hintikka and P. Suppes, Amsterdam: North Holland, 219-264, 1966. [Seidenfeld, 1979] T. Seidenfeld. Why I am not an Objective Bayesian: Some Reflections Prompted by Rosenkrantz, Theory and Decision, vol. 11, 413-440, 1979. [Smullyan, 1968] R. Smullyan. First Order Logic, New York: Dover, 1968. [Tarski, 1951] A. Tarski. A Decision Method for Elementary Algebra and Geometry, Berkeley: University of California Press, 1951.
This page intentionally left blank
BAYESIAN INDUCTIVE LOGIC, VERISIMILITUDE, AND STATISTICS Roberto Festa Below we will consider the relations between inductive logic and statistics. More specifically, we will show that some concepts and methods of inductive logic may be applied in the rational reconstruction of several statistical notions and procedures and that, in addition, inductive logic suggests some new methods which can be used for different kinds of statistical inference. Although there are several approaches to inductive logic and statistics, here we will focus on some versions of the Bayesian approach and, thereby, on the relations between Bayesian inductive logic and Bayesian statistics. The paper is organized as follows. The subjects of inductive logic and statistics will be shortly illustrated in Section 1, where it will be suggested that statistics can be seen as a special field of inductive logic. Two important theories developed within Bayesian inductive logic are the theory of inductive probabilities, started by Rudolf Carnap in the forties of the past century, and the theory of confirmation: the conceptual relations between such theories and statistics will be considered in Section 2. A recent version of Bayesian inductive logic, proposed by Ilkka Niiniluoto and others, has been developed by using the notion of verisimilitude, introduced in philosophy of science by Karl Popper; the key ideas of the verisimilitudinarian version of Bayesian inductive logic will be illustrated in Section 3, where it will be argued that it provides useful conceptual tools for the analysis of some important kinds of statistical inference. 1
1.1
BAYESIAN INDUCTIVE LOGIC AND BAYESIAN STATISTICS
Inductive logic and statistics
Empirical sciences make use of different kinds of inferences leading from a set of one or more statements called premises to another statement, called conclusion. According to a traditional distinction inferences are divided into deductive and inductive. While in deductive inferences the information conveyed by the conclusion is already included in the premises, in inductive inferences the conclusion says something extra with respect to the premises. Hence, even though the truth of the premises is taken as guaranteed, there is the inevitable risk that the extra information conveyed by the conclusion of an inductive inference may be false. This means that the conclusion of an inductive inference is inevitably uncertain. The systematic analysis of inductive inferences is the subject matter of inductive Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
474
Roberto Festa
logic. Suppose that, in an inquiry on the hypothesis H, evidence is expressed by a sentence E, where H cannot be deduced from E. In this case we might ask what are the inductive relations between E and H. A good inductive logic should provide, among other things, a satisfying analysis of the inductive relations between evidence and hypotheses. In particular, it should answer the following questions about the plausibility, confirmation, and acceptability of H in the light of E: (1) What is the degree of plausibility of H in the light of E? (2) What is the degree of confirmation conveyed by E to H, i.e., what is the increase of the initial plausibility of H determined by E? (3) Given evidence E, should we accept H, in some suitably specified sense of acceptance? The inductive inferences involved in the questions (1)–(3) may be labelled as plausibilistic, confirmational, and acceptance-based inferences, respectively. A statistical inference is a particular type of inductive inference where evidence is (the description of) a sample taken from a given population and the hypothesis concerns the whole population or certain samples to be drawn from it. More generally, a statistical inference is an inductive inference where evidence is (the description of) a sequence of trials of a given experimental process and the hypothesis concerns certain parameters of the process or certain future sequences of trials. The main goal of statistics is the formulation of a good theory of statistical inferences, i.e., of a systematic corpus of procedures applicable to any kind of statistical inference. The above characterization of inductive logic and statistics suggests that statistics is a province of the wide continent of inductive logic. Indeed, if inductive logic is construed as the study of inductive reasoning, wherever and however it occurs, then an ideal inductive logician looks for stable patterns underlying the various forms and areas of inductive reasoning. Hence an ideal inductive logician looks, among other things, for stable patterns underlying the statistical reasoning in empirical sciences. This means that the subject matter of statistics belongs to inductive logic, i.e., that the basic relation between the two disciplines is the very close relation occurring between a whole and a part of it. However, in spite of this intimate relation, starting at least from the beginning of the past century, inductive logic and statistics have undergone a separate development. While inductive logic has been developed within the ground of philosophy, especially within epistemology and philosophy of science, statistics has impetuously grown as a highly specialized discipline, with its own departments and faculties, in a constant interaction with scientific practice, including the more sophisticated outcomes of probability theory and, more generally, of mathematical research. This separate growth of inductive logic and statistics has led, among other things, to the development of different languages, conceptual systems, and formal tools, so that it is difficult to recognize that in many cases the two disciplines deal with the same problems, and sometimes propose essentially identical solutions. Fortunately, in the last few decades there have been some signals of an opposite tendency, i.e., of an increasing awareness of the close conceptual relations and the potential fruitful interactions between inductive logic and statistics.
Bayesian Inductive Logic, Verisimilitude, and Statistics
1.2
475
The Bayesian approach to inductive logic and statistics
The most popular approach to inductive logic is the Bayesian approach. The basic idea of Bayesian inductive logic (for short: BIL) is that any kind of inductive inference should be analysed in terms of inductive — or, equivalently, epistemic — probabilities. This means than any inductive inference from evidence E to the hypothesis H amounts to — or, at least, is based on — the attribution of the epistemic probability p(H|E) to H, where p(H|E) expresses the degree of belief that a rational agent X should attribute to H in the light of E. Within the different versions of BIL this basic idea has been applied in the analysis of plausibilistic, confirmational, and acceptance-based inferences. For instance, the plausibility of H in the light of E is usually identified with p(H|E) and the confirmation conveyed by E to H is usually identified with some measure of the probability increase in the shift from the initial probability p(H) of H to its final probability p(H|E). Different versions of BIL may be distinguished on the basis of the methodological importance attributed to the different kinds of inductive inference. For instance, most supporters of BIL advocate a non-acceptance-based approach, according to which acceptance-based inferences are dispensable both in science and everyday life, where we need only plausibilistic and confirmational inferences. In conflict with this view, some Bayesian inductive logicians defend an acceptance-based approach which emphasizes the methodological indispensability of acceptance-based inferences. Within this approach two basic kinds of acceptance rules have been proposed: (i) purely probabilistic acceptance rules, where the acceptance of a hypothesis H depends only on its probability; (ii) decision-based acceptance rules, where the acceptance of H is seen as a cognitive decision depending not only on the probability of H but also on the (suitably defined) cognitive utilities associated with the acceptance of H in the possible states of nature: for instance, the acceptance rules stated within the so-called cognitive decision theories prescribe to accept the hypothesis which maximizes the expected cognitive utility [Levi, 1967; 1980; Niiniluoto, 1987; Festa, 1999b; 2003]. While in the last century Bayesianism has been the dominant view among inductive logicians, the orthodoxy among statisticians has been anti-Bayesian. Indeed, the so-called orthodox, or frequentist, statistics developed in the twentieth century — on the track of Jerzy Neyman, Egon S. Pearson, and Ronald A. Fisher [1925; 1935; 1950; 1956] — is based on the explicit rebuttal of the Bayesian idea that statistical inferences can be made by attributing epistemic probabilities to statistical hypotheses. In conflict with this idea, orthodox statisticians maintain that statistical frequencies are the only probabilities to be used in statistical inferences. However, starting from the second half of the twentieth century, the Bayesian approach to statistics has known — thank to the work by Bruno de Finetti [1937/1964; 1974], Harold Jeffreys [1931; 1937], Leonard J. Savage [1954], and many others — an impressive development, so that nowadays Bayesian statistics (for short: BS ) represents a serious alternative to frequentist statistics.
476
2
2.1
Roberto Festa
THEORY OF INDUCTIVE PROBABILITIES, CONFIRMATION, AND STATISTICS
Inductive methods for multinomial inferences
Most of the work on BIL made by Rudolf Carnap between the beginning of the forties and his death, in 1970, is directed to the development of a theory of epistemic probabilities for the analysis of a well known kind of statistical inference, i.e., multinomial inferences. His theory is usually known under the label of theory of inductive probabilities (for short: TIP). Below we will shortly illustrate the notion of inductive method for multinomial inferences — for short: multinomial method — and the different strategies used within BS and TIP for the construction of such methods. Afterwards, we will consider some (non-analogical and analogical) multinomial methods introduced within TIP and BS and we will show that, in spite of their different conceptual basis, the methods worked out within TIP are essentially identical to those used within BS. 2.1.1
Multinomial methods within TIP and BS
Consider an experimental process Ex whose trials are described by using a qualitative character, or property, Q = {Q1 , ..., Qk }, where the k (> 2) categories Q1 , ..., Qk are mutually exclusive and jointly exhaustive. We say that Ex is a multinomial process in the case where, for any Qi ∈ Q, there is a physical probability qi that the outcome of an arbitrary trial of Ex is Qi , where qi is constant across trials — neither influenced by the trial number nor by the outcomes of previous trials. The background knowledge available in many empirical inquiries includes the assumption that the investigated process Ex is a multinomial process. However, typically a researcher does not know the true value of the parameter vector q = (q1 , ..., qk ) of Ex, but can only formulate different kinds of hypotheses about q and the results of future trials of Ex. The inductive inferences concerning such hypotheses will be called multinomial inferences. More precisely, a multinomial inference about Ex amounts to the determination of the epistemic probability p(H|E), where each of the sentences H and E may concern either the value of q or the result of some sequence of trials of Ex. A multinomial method is an inductive procedure allowing to make any kind of multinomial inference about any multinomial process Ex. Within BS the starting point for the definition of a multinomial method is given by a suitable representation of the researcher’s initial opinions about the value of the parameter vector q of the investigated multinomial process Ex, i.e., by an appropriate prior distribution F (q) of epistemic probabilities on the possible values of q. A different strategy is adopted within TIP, where the starting point for the construction of a multinomial method is given by the definition of an inductive rule R specifying, for any possible sequence en of outcomes of n trials of Ex and any category Qi ∈ Q, the predictive probability p(Qi |en ) of the hypothesis that the result of the next trial of Ex will belong to Qi . Then the predictive
Bayesian Inductive Logic, Verisimilitude, and Statistics
477
probabilities p(Qi |en ) may be used to determine the prior probability p(en ) of any sequence en of outcomes of n future trials of Ex, the posterior probability p(en |em ), where em describes a sequence of outcomes of m past trials of Ex, and any other epistemic probability p(H|E) associated to Ex. In Carnap’s terminology predictive probabilities p(Qi |en ) are referred to as the special values of a multinomial method, where “special” refers to their special importance in the definition of a multinomial method. The conceptual differences between the strategies for the construction of multinomial methods adopted within BS and TIP can be construed as the differences between a globalistic and a predictivistic approach to multinomial inferences. Indeed, the strategy adopted within BS is based on the attribution of appropriate epistemic probabilities to the possible values of the parameter vector q of the multinomial process Ex which is, as it were, a global feature of Ex. On the contrary, the strategy adopted within TIP is based on the attribution of appropriate epistemic probabilities to any possible prediction concerning a future trial of Ex in the light of any possible sequence of past trials of Ex. Although this conceptual difference between TIP and BS cannot be neglected, it should be recalled that there is also an important conceptual equivalency between TIP and BS. We refer to the circumstance that an important class of multinomial methods investigated within TIP, i.e., the exchangeable multinomial methods, is essentially identical to the multinomial methods considered within BS. Let ni denote the frequency of Qi in sequence en of trials of Ex, i.e., the number of outcomes in en belonging to the category Qi . Then a multinomial method for Ex is called exchangeable in the case where, for any n and any en , the prior probability p(en ) depends only on n1 , ..., nk — and not on the order in which the outcomes Q1 , ..., Qk occur in en . De Finetti’s representation theorem implies that any exchangeable multinomial method I for Ex is essentially identical to a corresponding (unique) prior distribution FI (q) defined on the parameter vector q of Ex [de Finetti, 1937/1964; 1974]. This means that the exchangeable multinomial methods worked out within TIP are essentially identical to the multinomial methods considered within BS. 2.1.2
The equivalence between GC-methods and Dirichlet distributions.
A well known family of exchangeable multinomial methods is given by generalized Carnapian methods (for short: GC-methods), which are characterized by the following inductive rule [Carnap and Stegm¨ uller, 1959]: (GC-rule)
p ( Qi | en ) =
ni + λγi n+λ
where γi > 0, Σγi = 1, and 0 ≤ λ ≤ ∞. The multinomial methods used within BS are typically based on the so-called Dirichlet distributions, which can be defined on the parameter vector q of any multinomial process Ex. It can be easily proved that — as a consequence of de
478
Roberto Festa
Finetti’s theorem — any GC-method is equivalent to a corresponding Dirichlet distribution. It may be interesting to point out that, in spite of its relative triviality, the equivalence between GC-methods and Dirichlet distributions is not mentioned in Carnap’s papers and, more generally, that it passed unnoticed until 1965 [Good, 1965; Festa, 1993]. This circumstance clearly reveals the almost surprising lack of interactions between TIP and BS in the forties and fifties of the past century. However, starting from the seventies, the conceptual relations between TIP and BS have been increasingly understood. 2.1.3
The axiomatization of GC-methods and Dirichlet distributions
Between the fifties and the seventies of the past century, inductive logicians and Bayesian statisticians succeeded in axiomatizing GC-methods and Dirichlet distributions. Axiomatizing a class of inductive methods amounts to proving that the class includes all and only those methods which satisfy a given set of axioms. In particular, GC-methods have been axiomatized by several authors [Carnap and Stegm¨ uller 1959; Carnap 1980, § 19; Kuipers 1978, Ch. 5]. Their results may be summarized as follows: (T.1) An exchangeable multinomial method I for a family Q = {Q1 , ..., Qk } of k categories, where k > 2, is a GC-method in the case where the predictive probabilities p(Qi en ) determined on the basis of I satisfy the following principles: (IP) Principle of initial possibility. p(Qi ) ≡ γi > 0
(RR) Principle of restricted relevance. If en and e′n are such that ni = n′i , then p(Qi |en ) = p(Qi |e′n ). Recalling that Dirichlet distributions are prior distributions defined on the parameter vector q of a multinomial process Ex, one may ask whether Dirichlet distributions can be axiomatized on the basis of appropriate principles expressing plausible constraints on the way in which we assign our prior probabilities to the possible values of q. We will see below that a positive answer to this question can be given. Axiomatizing a class of prior distributions amounts to proving that the class includes all and only those distributions which satisfy a given set of axioms. In particular, Dirichlet distributions have been axiomatized by Fabius [1973]. His axiomatization is based on the concept of neutrality. Put simply, given the proportions q1 , ..., qk — which are defined as non-negative parameters satisfying the constraint Σqi = 1 — the proportion q1 is said to be neutral w.r.t. q = (q1 , ..., qk ) when no information on the value of q1 can influence our beliefs concerning “the manner in which the remaining proportions q2 , ..., qk proportionally divide the remainder of the unit interval” [Connor and Mosimann, 1969, p. 196]. Fabius’s axiomatization of Dirichlet distributions (ibid.) can be stated as follows:
Bayesian Inductive Logic, Verisimilitude, and Statistics
479
(T.2) A prior distribution F (q) on the parameter vector q = (q1 , ..., qk ) of a multinomial process Ex is a Dirichlet distribution in the case where F (q) satisfies the following principle: (N)
Principle of neutrality. F (q) is such that any qi is neutral w.r.t. q.
An important difference between the axiomatizations of GC-methods and Dirichlet distributions should be noticed. The principles (IP) and (RR), used in the axiomatization of GC-methos, are predictive principles concerning the behaviour of the inductive rule used to determine the predictive probabilities p(Qi en ); on the contrary, the principle (N), used in the axiomatization of Dirichlet distributions, is a global principle concerning our opinions about the global feature q of Ex, i.e., the probabilities attributed to the possible values of q. Given the essential identity between GC-methods and Dirichlet distributions, we might say that each of the theorems (T.1) and (T.2) provides an axiomatic foundations both of GC-methods and Dirichlet distributions. However, while the axiomatization stated in (T.1) provides a predictivistic justification of GC-methods and Dirichlet distributions, the axiomatization stated in (T.2) provides a globalistic justification of such methods.1 2.1.4
The non-analogical nature of GC-methods and Dirichlet distributions
The principles (RR) and (N), used in the axiomatizations of GC-methods and Dirichlet distributions immediately reveal the non-analogical nature of such multinomial methods. Indeed, the principle (RR), used in the axiomatization of GCmethods, implies that the predictive probabilities p(Qi |en ) of a GC-method depend on the empirical evidence en only via n and ni . In other words, (RR) implies that p(Qi |en ) should in no way be affected by the empirical frequencies nj — with j 6= i — in en ; this independently of the possible, more or less strong, similarity occurring between categories Qj and Qi . This means that GC-methods cannot take into account any analogy by similarity, i.e., that GC-methods are not analogical methods. Similar anti-analogical intuitions seem to underlie the principle (N), used in the axiomatization of Dirichlet distributions. Indeed (N) excludes any analogy by similarity by requiring that our beliefs about qi should in no way be affected by information concerning the value of the fraction qj /(1 − qi ) — with j 6= i — in the remainder (1 − qi ) of the unit interval; this independently of the possible, more or less strong, similarity occurring between categories Qj and Qi . 2.1.5
Analogical methods for one property
Although most of Carnap’s work on TIP was devoted to the investigation of GCmethods and other non-analogical multinomial methods, the construction of ana1 On the axiomatic foundation of GC-methods and Dirichlet distribution — and the justification of the use of such multinomial methods in empirical inquiries — see [Festa, 1993, Ch. 6 and 7.6].
480
Roberto Festa
logical inductive methods applicable to multicategorical experiments — i.e., to experiments whose outcomes are described by one or more categories — has always been on Carnap’s agenda. Let us consider, first of all, the problem of defining appropriate analogical methods for one property. Such methods may be introduced w.r.t. to any kind of experimental process described by a property Q = {Q1 , ..., Qk } with k > 2. This means that Ex might be a multinomial process, but it might also be a different kind of physical process, or a process whose nature is not exactly known. Given three categories Qi , Qj , and Qk belonging to Q, suppose that the similarity between Qj and Qi is higher than that between Qk and Qi . If our predictive probabilities are determined by using a GC-method, then equality p(Qi |Qj ) = p(Qi |Qk ) holds; more generally, for any sequence en of trials of Ex, equality p(Qi |en Qj ) = p(Qi |en Qk ) holds. This means that, if a GC-method is used, then the predictive probabilities are not affected by the similarity relations occurring among Qi , Qj , and Qk . However, in certain cases we might require that our inductive method is sensible to such similarity relations; for instance, we might require that it satisfies the following weak principle of analogy by similarity: (WAS) For any triple Qi , Qj , and Qk such that Qj is more similar than Qk to Qi , p(Qi |Qj ) > p(Qi |Qk ). In some cases we might also require that our inductive method satisfies the following strong principle of analogy by similarity: (SAS) For any sequence en of trials of Ex and any triple Qi , Qj , and Qk such that Qj is more similar than Qk to Qi , p(Qi |en Qj ) > p(Qi |en Qk ). We will say that an inductive method is a weakly analogical method in the case where it satisfies (WAS) or akin weak principles of analogy by similarity, while it is a strongly analogical method in the case where it satisfies (SAS) or akin strong principles of analogy by similarity. Within TIP the problem of defining appropriate analogical methods for one property has been investigated by Carnap [1963; 1980], Niiniluoto [1981], Costantini [1983], Kuipers [1984a; 1984b; 1988], Skyrms [1993a], Festa [1996], and others. The methods proposed by the just mentioned authors can be classified as follows: 1. Strongly analogical non-exchangeable methods. The analogical methods proposed by Carnap [1980], Niiniluoto [1981], Costantini [1983], and Kuipers [1984a; 1984b; 1988] are of this kind. Their methods are introduced in the TIP-style, i.e., on the basis of simple and intuitively transparent inductive rules specifying the predictive probabilities p(Qi en ). Such inductive rules are explicitly designed to guarantee the satisfaction of some strong principles of analogy by similarity, such as (SAS) (Carnap), or akin principles (Niiniluoto and Kuipers). Unfortunately, the above methods exhibit also three related and not very attractive features: (i) they are not exchangeable; (ii) thereby, they are not multinomial methods; indeed, the adoption of
Bayesian Inductive Logic, Verisimilitude, and Statistics
481
a non-exchangeable method implies that the investigated process Ex is not multinomial, since any prior distribution F (q) on the parameter vector q of a multinomial process is essentially identical to an exchangeable method; (iii) finally, it is not clear whether the adoption of such methods imposes some constraint on the (non-multinomial) physical structure of Ex. 2. Weakly analogical exchangeable methods. A method of this kind has been proposed by Skyrms [1993a]. Such method — which will be called Sk — is introduced in the BS-style, i.e., as a prior distribution FSk (q) on the parameter vector q of a particular multinomial process Ex, given by a wheel of fortune whose possible outcomes are described by the property Q = {North, East, South, West}. More precisely, FSk (q) is given by a mixture of four Dirichlet distributions defined on q. This implies that Sk is exchangeable, i.e., that Sk is a multinomial method. One can easily check that Sk is weakly analogical since it satisfies (WAS). However, none of the strong principles of analogy so far proposed in the literature is satisfied by Sk [Festa, 1996]. Finally, it is not clear whether Sk can be generalized so to obtain a class of strongly analogical methods applicable to different kinds of multinomial process. 3. Strongly analogical exchangeable methods. A method of this kind has been proposed by Festa [1996]. Also Festa’s method — which will be called Fe — is introduced in the BS-style, i.e., as a prior distribution FF e (q) on the parameter vector q of a particular multinomial process Ex, described by the property Q = {Q1 , Q2 , Q3 } with three categories. More precisely, FF e (q) is given by a mixture of two Dirichlet distributions defined on q. This implies that Fe is exchangeable, i.e., that it is a multinomial method. Moreover, it can be proved that Fe is strongly analogical since it satisfies (SAS). However, together with such attractive features, Fe shares with Sk an unpleasant feature; indeed, it is not clear whether and how Fe can be generalized to different kinds of multinomial process. The research on analogical methods for one property seems to be an important field for the cooperative development of TIP and BS: indeed, some of the above mentioned analogical methods have been developed by using the conceptual tools both of TIP and BS. However, one has to recognize that so far the research on analogical methods for one property has not led to outcomes comparable with those of the research on non-analogical methods and, in particular, on GC-methods and Dirichlet distributions. Among other things, it should be recalled that, so far, (i) no general class of exchangeable analogical methods has been proposed, and (ii) no axiomatic foundation of exchangeable analogical methods has been provided. 2.1.6
Analogical methods for many properties
Another interesting problem considered within TIP is the definition of appropriate analogical methods for many properties. Such methods have been investigated with
482
Roberto Festa
reference to multinomial processes whose possible outcomes are cross classified on the basis of two or more properties [Carnap, 1963; Hesse, 1964; Maher, 2000; 2001; Romeyn, 2005; 2006]. In the most simple case, the outcomes are described by using two properties F = {F1 , F2 } and G = {G1 , G2 } where F2 ≡ non-F1 and G2 ≡ non-G1 . Given two trials a and b of a multicategorical experiment Ex, we might require that our predictive probabilities satisfy the following analogical principle: (A) p(G1 b|F2 G1 a&F1 b) > p(G1 b|F2 G2 a&F1 b) Suppose, for instance, that Ex is given by the random drawing of an individual from the population of swans and that the members of such population are described by the properties F and G, where F1 stands for “Australian” and G1 for “white”. Then a and b will refer to individuals randomly drawn from the population of swans. With reference to this example, the intuitive meaning of (A) can be expressed as follows: the probability that the next observed Australian swan is white, given that a previously observed non-Australian swan is white, is greater than the probability of such hypothesis given that a previously observed non-Australian swan is non-white. If an inductive method satisfies (A) or other akin analogical principles, we will say that it is an analogical method for many properties. Several analogical methods for many properties have been introduced in the literature. While the investigation of analogical methods for many properties was initially carried out using only the conceptual tools of TIP [Carnap, 1963; Hesse, 1964; Niiniluoto, 1980; 1988], the research on this subject made in the last decade is characterized by the extensive use also of the conceptual tools of BS [Maher, 2000; 2001; Romeyn, 2005; 2006]. In particular, such research has shown that analogical methods for many properties have close — and still not fully explored — relations with the statistical research on contingency tables [Albert and Gupta, 1983; Epstein and Feinberg, 1992; Good, 1965;1983; Lindley, 1964].2
2.2
Inductive methods for non-multinomial statistical inferences
While the research on analogical methods has not led to outcomes comparable with those of the research on GC-methods and Dirichlet distributions, a much more successful story can be told about the research made on other two issues which were on Carnap’s agenda. We are referring to the construction of inductive methods for a value continuum and of inductive methods sensitive to analogy by proximity. Although here there is no room for this successful story, the reader can find an lucid survey of the research in this area in Skyrms [1996].3 Among other things, Skyrms clearly shows that such research is characterized by a deep integration of TIP and BS methods. For instance, within three years of Carnap’s 2 For an explicit reference to the relation between Carnap’s analogical methods for two properties and the problem of estimating cell probabilities in the case of a two-dimensional contingency table, see [Costantini and Garibaldi, 1996]. 3 See also [Skyrms, 1991; 1993b].
Bayesian Inductive Logic, Verisimilitude, and Statistics
483
death, the problem of the construction of inductive methods for a value continuum was “completely solved in a way quite consonant with Carnapian techniques” [Skyrms ib., p. 321], by three Bayesian statisticians, i.e., Blackwell and MacQueen [1973] and Ferguson [1973].
2.3
Confirmation and statistics
As said in Section 1.2, the confirmation conveyed by evidence E to a hypothesis H is usually identified with some measure of the probability increase in the shift from the initial probability p(H) of H to its final probability p(H|E). Two popular incremental measures of confirmation are the probability difference cd (H, E) ≡ p(H|E) − p(H) and the probability ratio cr (H, E) ≡ p(H|E)/p(H). Another popular incremental measure is defined in terms of the initial and final odds of H, i.e., in terms of o(H) ≡ p(H)/(1 − p(H)) and o(H|E) ≡ p(H|E)/(1 − p(H|E)). We are referring to the so-called odds ratio cor (H, E) ≡ o(H|E)/o(H). An attractive feature of cor (H, E) is given by the easily proved equality co r (H, E) = p(E|H)/p(E|¬H). The quantity p(E|H)/p(E|¬H) is commonly known as Bayes factor (in favour of H). It should be noticed that, while cor (H, E) is well defined only in the case where p(H) is positive, in principle the Bayes factor p(E|H)/ p(E|¬H) may be well defined also in the case where p(H) is zero. More generally, one immediately sees that the initial probability of H is not necessary for the determination of the Bayes factor in favour of H. These feature of Bayes factor explains why it has been considered as an attractive measure of confirmation for statistical hypotheses, also by a number of non-Bayesian statisticians, who would be scarcely inclined to adopt confirmation measures stated in terms of the initial of final probabilities of statistical hypotheses. Although the research on the confirmation of scientific hypotheses has been carried out mainly within the general framework of inductive logic [Festa, 1999a; 2009; Festa et al., 2010; Fitelson, 1999], in the last few decades the concept of confirmation — or, equivalently, empirical support — has attracted an increasing attentions among statisticians. For instance, several frequentist statisticians have suggested that the so-called p-values of statistical hypotheses can be construed as an appropriate measure of their degree of empirical support. The first Bayesian statistician who has devoted a lot of attention to the confirmation measures suggested within inductive logic — and to the possibility to apply such measures to statistical hypotheses — is I. J. Good. Among other things, Good provides a thorough analysis of Bayes factor and suggests a Bayesian rational reconstruction of the measures of corroboration proposed by Karl Popper as an alternative to Bayesian measures of confirmation.4
4 See especially [Good, 1950; 1960; 1960/1968; 1961; 1968; 1975; 1980; 1981a; 1981b; 1982; 1983; 1984; 1985a; 1985b; 1989a; 1989b; 1989c; 1991; 1997] and [Good and McMichael, 1984].
484
Roberto Festa
3
3.1
VERISIMILITUDE AND STATISTICS
Verisimilitude and inductive logic: towards a verisimilitudinarian Bayesian inductive logic
A recent version of BIL, which may be called verisimilitudinarian BIL, is based on the integration, within the conceptual framework of BIL, of the notion of verisimilitude. The first formal notion of verisimilitude — or approximation to the truth — was worked out by Karl Popper within his falsificationist and anti-inductivist philosophy of science [Popper, 1963]. Afterward such notion, and its methodological applications, have been extensively investigated within the post-Popperian theories of verisimilitude, emerged since 1975.5 Two basic issues dealt with by such theories are the following: 1. The analysis of the logical problem of verisimilitude, in order to provide appropriate notions of verisimilitude and distance from the truth. Such notions should allow us to compare any two theories w.r.t. their closeness to the truth, even in the case where they are false. Post-Popperian theories of verisimilitude generally admit that adequate notions of verisimilitude and distance from the truth should leave room for the possibility that, under certain conditions, a false theory is closer to the truth than a true one. 2. The analysis of the epistemic problem of verisimilitude, in order to define the notions of estimated verisimilitude and estimated distance from the truth. Such notions should allow to compare, on the basis of the available data, any two theories w.r.t. their estimated closeness to the truth, even in the case where they have been falsified by such data. It is widely agreed that adequate notions of estimated verisimilitude and estimated distance from the truth should allow that, under certain conditions, the estimated verisimilitude of a falsified theory is higher than that of a non-falsified one. The notion of verisimilitude may be used in the analysis of all the three kinds of inductive inference defined in Section 1.2, i.e., plausibilistic, confirmational, and acceptance-based inferences. In fact one may define the plausibility of H in the light of E as the estimated verisimilitude of H in the light of E and may identify the degree of confirmation conveyed by E to H is with an appropriate measure of the increase of the initial estimated verisimilitude of H determined by E. Finally, one may state verisimilitudinarian rules of inductive acceptance based on the idea that verisimilitude is the proper cognitive utility of science, so that, within a set of rival hypotheses, one should accept the hypothesis which maximizes the expected verisimilitude. 5 An excellent survey of the post-Popperian theories of verisimilitude is provided by Niiniluoto [1998].
Bayesian Inductive Logic, Verisimilitude, and Statistics
3.2
485
Verisimilitudinarian Bayesian inductive logic and statistics
In the last twenty years the notions and principles of verisimilitudinarian BIL have been applied in the rational reconstruction of a number of widely used statistical procedures. Some of these applications are shortly illustrated below. 1. The choice of the optimum multinomial methods. Multinomial inferences might be defined as statistical inferences aiming at establishing the true value of the parameter vector q = (q1 , ..., qk ) of a multinomial process Ex. A major problem for the Bayesian analysis of multinomial inferences — which has been thoroughly investigated both in TIP and BS — is the choice of an optimum prior distribution F ∗ (q) on q. Festa [1993; 1995] considers this problem with reference to Dirichlet distributions (or, equivalently, GCmethods) and suggests a verisimilitude solution, inspired by the intuitive idea that the optimum Dirichlet distribution F ∗ (q) should be chosen so that the estimates of q1 , ..., qk based on F ∗ (q) approach the truth most effectively. This means that a good reason for selecting F ∗ (q), within the class of Dirichlet distributions, is that there are grounds to believe that F ∗ (q) is the optimum tool for achieving a high degree of verisimilitude. The main points of Festa’s verisimilitude solution can be informally stated as follows: (i) the efficiency of a Dirichlet distribution (GC-method) in approaching the truth depends on the degree of order in the sector of the universe under investigation; (ii) hence, the adoption of a given Dirichlet distribution (GC-method) as the optimum multinomial method — i.e., as the presumably most efficient tool to approach the truth — should be based on appropriate presuppositions about the degree of order in the investigated sector of the universe; (iii) such presuppositions are given by the scientists’ initial estimates — grounded on their background knowledge — of the degree of order of the sector of the universe under examination within a given empirical inquiry.indexepistemic probability 2. Bayesian point and interval estimates. Given a distribution of epistemic probabilities on the relevant set of statements, one can calculate, on the basis of available evidence E, the expected verisimilitude and the expected distance from the truth of any hypothesis H under inquiry. One can construct a theory of inductive acceptance based on the idea that the distance from the truth is the cognitive loss whose expected value should be minimized when a hypothesis is selected for inductive acceptance. This idea has been applied to the rational reconstruction of certain standard results about point estimation obtained in Bayesian statistics. More specifically, Niiniluoto [1982a, 1987, pp. 27-29] shows that, given a suitable loss function, accepting a Bayesian point estimate is equivalent to minimizing the expected distance from the truth. Furthermore, Niiniluoto [1982b; 1986; 1987, pp. 430-41] and others [Festa, 1986; Maher, 1993, pp. 143-47] extend this perspective to Bayesian interval estimation by approaching interval es-
486
Roberto Festa
timation in decision-theoretic terms, where the loss of accepting an interval is defined by the distance of this interval from the truth. 3. Bayesian analysis of observational errors. The expected verisimilitude of a hypothesis H can be seen as a reasonable estimate of the degree of verisimilitude of H, while the probable verisimilitude of H expresses our degree of belief in the possibility that H is highly verisimilar, i.e., that the verisimilitude of H exceeds a fixed threshold [Niiniluoto, 1987, Ch. 7.3]. The notions of expected and probable verisimilitude have been applied to the analysis of significant statistical problems such as the problem of observational errors. This problem typically arises in the cases where the evidence is obtained by measuring a given quantity: in fact, in such cases there is a positive probability that, due to errors in measurement, the result of measurement deviates from the true value of the quantity. After discussing in general terms the possibility of appraising verisimilitude on the basis of false evidence, Niiniluoto [1987, Ch. 7.4] suggests that the classical Bayesian treatment of observational errors can be understood as an attempt to evaluate the expected and probable verisimilitude of hypotheses on the basis of evidence which is known (with probability one) to be erroneous. 4. Problems of curve-fitting and regression analysis. In many cognitive problems — for instance, whenever the inquiry is based upon counterfactual idealizing conditions — all the relevant hypotheses under inquiry are known to be false. A well known example of this sort of cognitive problems is given by the statistical problems of curve-fitting, where all the relevant quantitative hypotheses, expressed by sufficiently simple linear functions, are known to be false. A verisimilitude interpretation of the typical statistical methods based on the Least Square Difference between a curve and a finite set of data points is provided by Niiniluoto [1987, Ch. 7.5]. He suggests also that verisimilitude could be used for the rational reconstruction of the statistical methods of regression analysis which are applied to problems in which all the relevant hypotheses are false (idealizations) and the evidence is known to be erroneous [Niiniluoto, 1987, p. 289]. 5. Approaching the statistical truth by qualitative theories. In the last three decades several measures of verisimilitude have been defined for different kinds of qualitative and quantitative theories, and their methodological applications have been thoroughly explored. Above we have seen, with reference to quantitative hypotheses, that the notion of verisimilitude can be fruitfully applied in the analysis of some important kinds of statistical inferences. However, recently it has been argued that also qualitative theories may have interesting statistical applications [Festa, 2007a; 2007b; 2008]. More precisely, it has been argued that qualitative theories stated in qualitative languages with two or more properties can be used in the description of the statistical structure of cross classified populations and that
Bayesian Inductive Logic, Verisimilitude, and Statistics
487
their adequacy in this task can be evaluated by appropriate measures of their statistical verisimilitude. Among other things, Festa’s notion of statistical verisimilitude has been used for the rational reconstruction of some aspects of the so-called prediction logic, a relatively recent approach to the statistical analysis of cross classified populations developed by the statisticians and social scientists David K. Hildebrand, James D. Laing, and Howard Rosenthal in a series of papers published between 1974 and 1977, which culminated in their not yet adequately appreciated book Prediction Analysis of Cross Classification (1976). The above, necessarily incomplete, survey of some research areas on the border between inductive logic and statistics shows the abundance of conceptual relations between the two disciplines. It also suggests that inductive logicians and statisticians should engage in a serious effort to reduce the inessential differences in their languages and conceptual systems and to explore the numerous possibilities of interaction between — and cooperative development of — the two disciplines.6
ACKNOWLEDGEMENTS I would like to thank an anonymous referee for helpful comments on the first draft of this paper. BIBLIOGRAPHY [Albert and Gupta, 1983] J. H. Albert and A. K. Gupta. Bayesian Estimation Models for 2 × 2 Contingency Tables Using Mixtures of Dirichlet Distributions. Journal of the American Statistical Association 78, 708-717, 1983. [Blackwell and MacQueen, 1973] D. Blackwell and J. B. MacQueen. Ferguson Distributions via Polya Urn Schemes. Annals of Statistics 1, 353–355, 1973. [Carnap, 1963] R. Carnap. Variety, Analogy and Periodicity in Inductive Logic. Philosophy of Science 30, 222–227 1963. [Carnap, 1980] R. Carnap. A Basic System of Inductive Logic, Part II. In R. Jeffrey (ed.), Studies in Inductive Logic and Probability, vol. II. Berkeley: University of California Press, pp. 7-155, 1980. [Carnap and Stegm¨ uller, 1959] R. Carnap and W. Stegm¨ uller. Induktive Logic und Wahrscheinlichkeit. Wien: Springer-Verlag, 1959. [Connor and Mosimann, 1969] R. J. Connor and J. E. Mosimann. Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution. JASA 64, 194-206, 1969. [Costantini, 1983] D. Costantini. Analogy by Similarity. Erkenntnis 20, 103–114, 1983. [Costantini and Garibaldi, 1996] D. Costantini and U. Garibaldi. Predictive Laws of Association in Statistics and Physics. Erkenntnis 45, 259–261, 1996. [de Finetti, 1937/1964] B. de Finetti. La pr´ evision: ses lois logiques, ses sources subjectives. Annales Institute Henri Poincar´ e 7, 1–68, 1937. English translation in Kyburg H. E. Jr. and Smokler, H. (eds.), Studies in Subjective Probability. New York: Wiley, pp. 93-158, 1964. [de Finetti, 1974] B. de Finetti. Probability, Induction and Statistics. New York: Wiley, 1974. 6 For an excellent survey of the conceptual relations between statistics and inductive logic — and, more generally, between statistics and philosophy of science — see [Good, 1988].
488
Roberto Festa
[Epstein and Feinberg, 1992] L. D. Epstein and S. E. Feinberg. Bayesian Estimation in Multidimensional Contingency Tables. In P. K. Goel and N. S. Iyengar (eds.), Bayesian Analysis in Statistics and Econometrics. New York: Springer, pp. 27-41, 1992. [Fabius, 1973] J. Fabius. Two Characterizations of the Dirichlet distributions. Annals of Mathematical Statistics 1, 583–587, 1973. [Ferguson, 1973] T. Ferguson. A Bayesian Analysis of Some Non-Parametric Problems. Annals of Statistics 1, 209–230, 1973. [Festa, 1986] R. Festa. A Measure for the Distance Between an Interval Hypothesis and the Truth. Synthese 67, 273–320, 1986. [Festa, 1993] R. Festa. Optimum Inductive Methods. A Study in Inductive Probability, Bayesian Statistics, and Verisimilitude. Dordrecht: Kluwer, 1993. [Festa, 1995] R. Festa. Verisimilitude, Disorder, and Optimum Prior Probabilities. In T. K. Kuipers and A. R. Mackor (eds.), Cognitive Patterns in Science and Common Sense. Rodopi: Amsterdam, pp. 299–320, 1995. [Festa, 1996] R. Festa. Analogy and Exchangeability in Predictive Inferences. Erkenntnis 45, 229–252, 1996. [Festa, 1999a] R. Festa. Bayesian Confirmation. In M. C. Galavotti e A. Pagnini (eds.), Experience, Reality, and Scientific Explanation. Dordrecht: Kluwer, pp. 55-87, 1999. [Festa, 1999b] R. Festa. Scientific Values, Probability, and Acceptance. In R. Rossini Favretti, G. Sandri e R. Scazzieri (eds.), Incommensurability and Translation. Cheltenham: Edward Elgar, pp. 323–338, 1999. [Festa, 2003] R. Festa. Induction, Probability, and Bayesian Epistemology. In L. Haaparanta and I. Niiniluoto (eds.), Analytic Philosophy in Finland. Amsterdam: Rodopi, pp. 251–284, 2003. [Festa, 2007a] R. Festa. Verisimilitude, Cross Classification, and Prediction Logic. Approaching the Statistical Truth by Falsified Qualitative Theories. Mind and Society 6, 37–62, 2007. [Festa, 2007b] R. Festa. The Qualitative and Statistical Verisimilitude of Qualitative Theories. La Nuova Critica 47-48, 91–114, 2007. [Festa, 2008] R. Festa. Verisimilitude, Qualitative Theories, and Statistical Inferences. In M. Sintonen, S. Pihlstr¨ om and P. Raatikainen (eds.), Approaching Truth: Essays in Honour of Ilkka Niiniluoto. London: College Publications, pp. 143–177, 2008. [Festa, 2009] R. Festa ‘For Unto Every One That Hath Shall Be Given’: Matthew Properties for Incremental Confirmation, Synthese. DOI 10.1007/s11229-009-9695-5. Forthcoming [Festa et al., 2010] R. Festa, V. Crupi, and C. Buttasi. Towards a Grammar of Bayesian Confirmation. In M. Su´ arez, M. Dorato, and M. R´ edei (eds.), EPSA Epistemology and Methodology of Science, vol 1. Dordrecht: Springer, pp. 73–93, 2010. [Fisher, 1925] R. A. Fisher. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd, 1925. [Fisher, 1935] R. A. Fisher. The Design of Experiments. Edinburgh: Oliver & Boyd. 1935. [Fisher, 1950] R. A. Fisher. Contributions to Mathematical Statistics. New York: Wiley, 1950. [Fisher, 1956] R. A. Fisher. Statistical Methods and Scientific Inference. Edinburgh: Oliver & Boyd, 1956. [Fitelson, 1999] B. Fitelson. The Plurality of Bayesian Measures of Confirmation and the Problem of Measure Sensitivity. Philosophy of Science 66, S362-S378, 1999. [Good, 1950] I. J. Good. Probability and the Weighing of Evidence. London: Charles Griffin and Co., 1950. [Good, 1960] I. J. Good. The Paradox of Confirmation. BJPS 11, 145–149, 1960. [Good, 1960/1968] I. J. Good. Weight Of Evidence, Corroboration, Explanatory Power, Information, and the Utility Of Experiments. JRSS B 22, 319-331, 1960; Corrigenda 30, 203, 1968. [Good, 1961] I. J. Good. The Paradox of Confirmation (II). BJPS 12, 63–64, 1961. [Good, 1965] I. J. Good. The Estimation of Probabilities. Cambridge, Mass.: The MIT Press, 1965. [Good, 1968] I. J. Good. Corroboration, Explanation, Evolving Probability, Simplicity, and A Sharpened Razor. BJPS 19, 123-143, 1968. [Good, 1975] I. J. Good. Explicativity, Corroboration, and the Relative Odds of Hypotheses. Synthese 30, 39-73 and 83-93, 1975. [Good, 1980] I. J. Good. Another Relationship Between Weight of Evidence and Errors of the First and Second Kinds. JSCS 10, 315-316, 1980.
Bayesian Inductive Logic, Verisimilitude, and Statistics
489
[Good, 1981a] I. J. Good. The Weight of Evidence Provided by Uncertain Testimony or from an Uncertain Event. JSCS 13, 56-60, 1981. [Good, 1981b] I. J. Good. An Error by Peirce Concerning Weight of Evidence. JSCS 13, 155-157, 1981. [Good, 1982] I. J. Good. A Good Explanation of an Event Is Not Necessarily Corroborated by the Event. Philosophy of Science 49, 251-253, 1982. [Good, 1983] I. J. Good. Good Thinking. Minneapolis: The University of Minnesota Press, 1983. [Good, 1984] I. J. Good. The Best Explication for Weight of Evidence. JSCS 19, 294-299, 1984. [Good, 1985a] I. J. Good. Weight of Evidence: A Brief Survey. In J. M. Bernardo, M. H. DeGroot, D.V. Lindley and A. F. M. Smith (eds.), Bayesian Statistic 2. New York: North Holland, 249-269. 1985. [Good, 1985b] I. J. Good. Sufficiency and Weight of Evidence. JSCS 21, 334-336, 1985. [Good, 1988] I. J. Good. The Interface Between Statistics and Philosophy of Science. Statistical Science 3, 386-397, 1988. [Good, 1989a] I. J. Good. Yet Another Argument for the Explication of Weight of Evidence. JSCS 31, 58-59, 1989. [Good, 1989b] I. J. Good. Weight of Evidence and a Compelling Metaprinciple. JSCS 31, 121123, 1989. [Good, 1989c] I. J. Good. The Theorem of Corroboration and Undermining, and Popper’s Demarcation Rule. JSCS 31, 119-120, 1989. [Good, 1991] I. J. Good. Weight of Evidence and the Bayesian Likelihood Ratio. In C. G. G. Aitken and D. A. Stoney (eds.), The Use of Statistics in Forensic Science. Boca Raton, Florida: CRC Press, 84–106, 1991. [Good, 1997] I. J. Good. Bayes Factors, Weights of Evidence and the Law. Review of J. B. Kadane and D. A. Schum, A Probabilistic Analysis of the Sacco and Vanzetti Evidence, New York: Wiley, 1966. JSPI 64, 171-191, 1997. [Good and McMichael, 1984] I. J. Good and A. F. McMichael. A Pragmatic Modification of Explicativity for The Acceptance of Hypotheses. Philosophy of Science 51, 120-127, 1984. [Hesse, 1964] M. Hesse. Analogy and Confirmation Theory. Philosophy of Science 31, 319-327, 1964. [Hildebrand et al., 1976] D. K. Hildebrand, J. D. Laing, and H. Rosenthal. Prediction Analysis of Cross Classification. New York: Wiley, 1976. [Jeffreys, 1931] H. Jeffreys. Scientific Inference. Cambridge: Cambridge University Press, 1931. [Jeffreys, 1939] H. Jeffreys. Theory of Probability. Oxford: University Press, 1939. [Kuipers, 1978] T. A. F. Kuipers. Studies in Inductive Probability and Rational Expectation. Reidel: Dordrecht, 1978. [Kuipers, 1984a] T. A. F. Kuipers. Two Types of Inductive Analogy by Similarity’, Erkenntnis 21, 63-87, 1984. [Kuipers, 1984b] T. A. F. Kuipers. Inductive Analogy in Carnapian Spirit. In P. Asquith and P. Kitcher (eds.), PSA 1984, Vol. I. East Lansing, Michigan.: Philosophy of Science Association, pp. 157-167, 1984. [Kuipers, 1988] T. A. F. Kuipers. Inductive Analogy by Similarity and Proximity. In D. H. Helman (ed.), Analogical Reasoning. Dordecht: Kluwer, 299-313, 1988. [Levi, 1967] I. Levi. Gambling with Truth. New York: Alfred A. Knopf, 1967. [Levi, 1980] I. Levi. The Enterprise of Knowledge: An Essay on Knowledge, Credal Probability, and Chance. Cambridge, Mass.: The MIT Press, 1980. [Lindley, 1964] D. V. Lindley. The Bayesian Analysis of Contingency Tables. Annals of Mathematical Statistics 34, 1622-1643, 1964. [Maher, 1993] P. Maher. Betting on Theories. Cambridge: Cambridge U. P., 1993. [Maher, 2000] P. Maher. Probabilities for Two Properties. Erkenntnis 52, 6-91, 2000. [Maher, 2001] P. Maher. Probabilities for Multiple Properties: The Models of Hesse and Carnap and Kemeny. Erkenntnis 55, 183-216, 2001. [Niiniluoto, 1980] I. Niiniluoto. Analogy, Transitivity, and the Confirmation of Theories. In L. J. Cohen and M. Hesse (eds.), Applications of Inductive Logic. Oxford: Oxford University Press, pp. 218-234, 1980. [Niiniluoto, 1981] I. Niiniluoto. Analogy and Inductive Logic. Erkenntnis 16, 1-34, 1981. [Niiniluoto, 1988] I. Niiniluoto. Analogy and Similarity in Scientific Reasoning. In D. H. Helman (ed.), Analogical Reasoning. Dordecht: Kluwer, pp. 271-298, 1988.
490
Roberto Festa
[Niiniluoto, 1982a] I. Niiniluoto. What Shall We Do with Verisimilitude?. Philosophy of Science 49, 181-197, 1982. [Niiniluoto, 1982b] I. Niiniluoto. Truthlikeness for Quantitative Statements. In P.D. Asquith and T. Nickles (eds.), PSA 1982, Vol. 1. East Lansing, Michigan.: Philosophy of Science Association, pp. 208-216, 1982. [Niiniluoto, 1986] I. Niiniluoto. Truthlikeness and Bayesian Estimation. Synthese 67, 321-346, 1986. [Niiniluoto, 1987] I. Niiniluoto. Truthlikeness. Dordrecht: Reidel, 1987. [Niiniluoto, 1998] I. Niiniluoto. Verisimilitude: The Third Period. BJPS 49, 11-29, 1998. [Popper, 1963] K. R. Popper. Conjectures and Refutations. London: Routledge and Kegan Paul, 1963. [Romeyn, 2005] J. W. Romeyn. Bayesian Inductive Logic. PhD dissertation, University of Groningen, 2005. [Romeyn, 2006] J. W. Romeyn. Analogical Predictions For Explicit Similarity. Erkenntnis 64, 253-280, 2006. [Savage, 1954] L. J. Savage. The Foundations of Statistics. New York: Wiley, 1954. [Skyrms, 1991] B. Skyrms. Carnapian Inductive Logic for Markov Chains. Erkenntnis 35, 439460, 1991. [Skyrms, 1993a] B. Skyrms. Analogy by Similarity in HyperCarnapian Inductive Logic. In J. Earman (ed.), Philosophical Problems of the Internal and External Worlds. Essays in the Philosophy of Adolf Gr¨ unbaum. Pittsburgh: University of Pittsburgh Press, pp. 273-282, 1993. [Skyrms, 1993b] B. Skyrms. Carnapian Inductive Logic for a Value Continuum. In H. Wettstein (ed.), The Philosophy of Science. Midwest Studies in Philosophy vol.18, South Bend, Indiana: University of Notre Dame Press, pp. 78-89, 1993. [Skyrms, 1996] B. Skyrms. Carnapian Inductive Logic and Bayesian Statistics. Statistics, Probability and Game, Theory IMS Lecture Notes - Monograph Series Volume 30, 321-336, 1996.
Likelihood Paradigm
This page intentionally left blank
LIKELIHOOD AND ITS EVIDENTIAL FRAMEWORK
Jeffrey D. Blume
1
INTRODUCTION
Statistics is the discipline responsible for the interpretation of data as scientific evidence. Not surprisingly, there is a broad statistical literature dealing with the interpretation of data as statistical evidence, the foundations of statistical inference, and the various statistical paradigms for measuring statistical inference. However, this literature is surprisingly diverse; the range of viewpoints, opinions and recommendations regarding methods for measuring statistical evidence is as varied as it is vast. This diversity is due, in part, to the complex philosophical nature of the problem. But it is also due to the absence of a generally accepted framework for characterizing and evaluating paradigms that purport to measure statistical evidence. This paper proffers a general framework that enables the comparison and the evaluation of statistical paradigms claiming to measure the strength of statistical evidence in data. The framework is simple and general, consisting of only three key components (i.e., three key definitions). These essential components may, at first, appear obvious. For example, the first component is nothing more than a definition of the mathematical quantity used to measure the strength of evidence in data.1 Unfortunately, the first component is always obvious and its (critical) definition is sometimes missing.2 Once defined, however, the behavior of competing measures can be detailed and contrasted. As we will see, more than just a measure of evidence is needed to understand and evaluate a paradigm for measuring statistical evidence. The absence of a well defined framework can lead to controversies like those surrounding ad-hoc adjustments to p-values for multiple looks.3 or for multiple comparisons.4 1 For
example, in classical frequentist inference this measure would be the p-value. an example consider Bayesian inference. Is it the Bayes factor or the posterior probability that measures the strength of evidence in data? 3 Reexamining accumulating data during the course of a study 4 Evaluating several different scientific endpoints (e.g., overall survival, cause-specific survival, safety) in a single study. 2 For
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
494
1.1
Jeffrey D. Blume
The three evidential quantities
Three essential quantities for assessing and interpreting the strength of statistical evidence in data are 1. the measure of the strength of evidence, 2. the probability that a particular study design will generate misleading evidence,5 3. the probability that observed evidence is misleading. For brevity I will sometimes denote these evidential quantities as EQ1, EQ2, and EQ3.6 The first and the third quantities inform the statistical evaluation of data as scientific evidence. The second quantity informs the data collection process. All three quantities are essential to science and to statistics.7 Each evidential quantity represents the answer to a critical scientific question. The first provides an answer to the question, “How strong is the evidence in these data?” The second answers the question, “What is the chance that my study will yield data that are misleading?” The third answers the question, “What is the chance that these observed data are misleading?” The first and third quantities depend on, and pertain to, the observed data; they reflect characteristics of an existing set of data. The second quantity depends on, and pertains to, the study design; it does not inform the interpretation of data because it presupposes that data have not yet been collected. Each quantity provides unique information regarding the interpretation (EQ1), collection (EQ2) and reliability (EQ3) of statistical evidence. In this paper, a statistical paradigm is said to have a well defined ‘evidential framework’ if that paradigm clearly defines, and distinguishes between, these three quantities. Once established, evidential frameworks are evaluated and contrasted by examining the statistical performance of these quantities in various scenarios.8
1.2
An analogy for the second and third evidential quantities
The second (EQ2) and third (EQ3) evidential quantities are routinely mistaken for one another. Once a set of data is actually collected, the probability of collecting 5 Misleading as measured by EQ1. It helps to think of misleading evidence as strong evidence in support of a false hypothesis. This allows for the possibility of weak or inconclusive evidence in support of a false hypothesis, which is typically not considered misleading in scientific applications. I will be more precise in upcoming illustrations. 6 I assert the importance of these quantities based on my experience and their immediate obviousness. Others have alluded to the same thing. For example, see Royall’s [1997] preface, [Blume, 2002; 2007], and [Strug and Hodge, 2006a; 2006b] as examples. 7 The natural temporal order in which these quantities are considered during scientific research is: EQ2, EQ1, EQ3. 8 For example, the rate at which an EQ1 converges to the identification of the correct hypothesis might be a comparison metric between frameworks. The framework with the quicker rate of convergence would probably require less data and therefore be preferable.
Likelihood and its Evidential Framework
495
some other set of data that turns out to be misleading (i.e., EQ2) is irrelevant. The observed data are either misleading or not, and we know not which. What is of interest is the potential for the observed data to be misleading (i.e., EQ3). After observing data, it makes perfect sense to ask “What is the probability that the data I just collected are misleading?” EQ2 characterizes the chance that the study will yield a misleading result. EQ3 characterizes the chance that the observed result is misleading. A simple example may help illuminate the distinction. Jena and Jamie each play the lottery.9 Jena buys one ticket. Jamie buys ten tickets, but one of the tickets is identical to Jena’s ticket. If Jena wins, so does Jamie. But Jamie can also win with one of her other nine tickets. Therefore, Jamie’s chance of winning is ten times that of Jena’s, but still very small at 10 out of 195,249,054. Here the probability of winning is analogous to EQ2 (e.g., “What is the chance of winning with these tickets?”).10 Both Jena and Jamie have different chances of winning due to their different game playing strategies. The next morning Jena and Jamie look in the newspaper for the winning lottery numbers. Unfortunately, the one of the numbers is smudged and unreadable.11 The remaining numbers, however, match Jena’s ticket. And because Jamie also has an identical ticket, both women may have a winning ticket. At this point, Jamie’s and Jena’s chance of winning is identical and is equal to 2.5%.12 Here the (updated) probability of winning is analogous to EQ3 (e.g., “What is the chance of winning after matching all but one number?”). The fact that Jamie had a ten-fold chance to win the lottery yesterday is completely irrelevant. Her ticket is just as valid as Jena’s, she stands to win just as much money, and she has the exact same chance of doing so. The insight is this: Once data are collected, EQ2 becomes irrelevant. What ‘might have been’ becomes irrelevant. It is EQ3, what ‘might be’, that is the relevant quantity once data are collected.
1.3
Absence of an evidential framework
The absence of a well defined evidential framework can lead to irresolvable controversies, such as those surrounding the proper use and interpretation of p-values or those concerning adjustments to p-values for multiple comparisons and multiple looks at data [Blume 2003, Goodman and Royall 1988, Goodman 1998, and Royall 1986, 1997]. I will revisit this later, but it suffices to note that the lack of clarity on which evidential quantity the p-value represents causes considerable confusion.13 9 By lottery I mean Powerball. To win the jackpot, players must correctly match 5 white balls (numbered 1 through 59) and one red ball (numbered 1 to 39) that are drawn randomly without replacement. The order in which the white balls are drawn does not matter. The odds of winning the jackpot are 1 in 195,249,054. 10 Here winning the lottery is akin to collecting misleading evidence and the tickets are akin to the study design. 11 I assume that the red ball number is unreadable, but the details are ancillary to this analogy. 12 Because only the red ball is in question, the chance of winning is now 1 in 39 or approximately 2.5%. 13 The p-value is often interpreted as each of the three quantities, sometimes at the same time.
496
Jeffrey D. Blume
Bayesians are not immune to this criticism either, as it remains unclear which of the evidential quantities a posterior probability14 is intended to represent. The advantages of a well defined evidential framework will be illustrated by examining the Likelihood paradigm in the context of multiple examinations of data and multiple comparisons. Among the three15 prominent statistical paradigms for measuring statistical evidence, only the Likelihood paradigm has a well developed evidential framework. In fact, the decoupling of the three evidential quantities allows Likelihood to obey the Likelihood principle16 and retain good frequentist properties without having to use ad-hoc adjustments or prior probabilities. This is because its EQ1 may adhere to the Likelihood principle without forcing an adjustment to EQ2, which naturally depends on the study design. Within a well defined evidential framework it is not a contradiction for two studies to yield identical data, and therefore equivalent statistical evidence, when each study initially had a different chance of generating misleading evidence (due to differing study designs). Moreover, we will see that neither set of observed data is more likely to be misleading than the other.17 In terms of evidential quantities, EQ2 may differ between studies, but those studies may generate identical data such that EQ1 and EQ3 are themselves identical (e.g., see the above lottery example). The often noted controversial case of this is when two different studies, each with a different stopping rule, yield the exact same data.18
2
THE LIKELIHOOD PARADIGM
The Likelihood paradigm is based on the Law of Likelihood, which explains when the data represent statistical evidence for one hypothesis over another. Simply put, the data better support the hypothesis that does a better job of predicting the observed events and the likelihood ratio measures the degree to which one hypothesis is better supported over the other [Hacking, 1965; Edwards, 1971; Royall, 1997]. Likelihood ratios are non-negative. Suggested benchmarks of 8 and 32 signify a transition from a weak to moderate level of evidence and from 14 A posterior probability is the probability of the hypothesis given the observed data. This uses the prior probability of the hypothesis which is set before the data were observed. 15 The three paradigms are Bayesian, Likelihood and frequentist. 16 The likelihood principle states that two studies yielding the same data and using the same probabilistic model must have equivalent measurements of the strength of the evidence in those data. P-values violate the likelihood principle. 17 It may seem obvious to assert that identical sets of data have exactly the same propensity to be misleading. The lottery example would certainly support this point of view. However just the opposite is the current dogma in statistics. It is commonly believed that two identical sets of data generally have different propensities to be misleading when the study designs from which they came differ. This dogma is a consequence of the failure to distinguish between EQ2 and EQ3. 18 The controversy arises because the classical approach does not yield the same strength of evidence in each study (despite the fact that they generated identical datasets). This is because the p-value is adjusted differently based on the study design (i.e. stopping rule). The validly of this approach has been debated for decades.
Likelihood and its Evidential Framework
497
a moderate to strong level evidence, respectively. Introductory material on this approach and extensions are readily available (e.g., [Blume, 2002; 2005; 2007; 2008; Goodman and Royall, 1988; Strug and Hodge, 2006a; 2006b; Royall and Tsou, 2003; Tsou and Royall, 1995; Van der Tweel, 2005]). The Law of Likelihood provides a measure of the strength of evidence between two hypotheses.19 In contrast, the Likelihood principle sets forth the conditions under which two experiments yield equivalent statistical evidence for two hypotheses of interest. This condition is met when their likelihood functions are proportional [Birnbaum, 1962; Barnard, 1949]. Acceptance of the law implies acceptance of the principle: if all the likelihood ratios between the two experiments are identical, then their likelihood functions must be proportional [Royall, 2000].
2.1
Illustration of the three evidential quantities
The most accessible illustration comes from Royall’s [1997] diagnostic test example. Suppose we use a diagnostic test to generate evidence about the disease state of an individual. The properties of this test are given in Table 1. Its sensitivity is 0.94 = P (T + |D+) and specificity is 0.98 = P (T − |D−).20 According to the law of likelihood a positive test result would be evidence supporting H+ (disease is present) over H− (disease is absent) because the likelihood ratio (LR) would be 47[= 0.94/0.02 = P (T + |D+)/P (T + |D−)]. Likewise, a negative result would represent evidence supporting H− over H+ because the likelihood ratio would be 16.3[= 0.98/0.06 = P (T − |D−)/P (T − |D+)]. The likelihood ratio is EQ1; it measures the degree to which the data support one hypothesis over another. Table 1. Properties of the diagnostic test Test Result (T )
Disease Status (D)
Positive (+)
Negative (−)
Yes (+)
0.94
0.06
No (−)
0.02
0.98
However, this test may generate evidence that is misleading. A positive result is correctly interpreted as evidence for H+ over H− , but positive results can occur when the disease is absent. This happens only 2% of the time in people without the disease, but when it does happen the test is said to have generated misleading evidence.21 Of course, the test can also generate misleading negative 19 This representation of evidence is essentially comparative between simple hypotheses. Simple hypothesis specify a single probability distribution. Composite hypothesis specify a family of distributions. 20 The vertical bar ‘|’ is read as ‘given’. This is a conditioning argument, as events after the bar are considered fixed. 21 Statistical tests (i.e., tests that are not deterministic) must be able to generate misleading
498
Jeffrey D. Blume
results and this happens 6% of the time in people who have the disease. These two probabilities are second evidential quantities (EQ2); they are the probabilities of observing misleading evidence under this study design. They are analogous to the error rates of hypothesis testing and they play an important role in defining the quality of the diagnostic test and the data collection process. A good diagnostic test maximizes sensitivity and specificity, which here is the same as minimizing the second evidential quantities (i.e., minimizing potential to observe misleading positive and negative tests). Upon observing a certain test result, the strength of the evidence will be clear from the likelihood ratio (e.g., a positive tests yields a LR of 47). We will never know if the observed test result is misleading or not. However, it is sometimes possible to determine the propensity for the observed test result to be misleading.22 For example, an observed positive result is misleading if and only if the subject does not have the disease. The probability that the subject does not have the disease is P (D − |T +). By Bayes theorem, this is P (D − |T +) = 1/[1 + 47π+ /π− ], where π+ = P (H+ ) and π− = P (H− ) = 1 − π+ are the prior probabilities of the individual’s disease state. By identical reasoning, an observed negative result is misleading if and only if the subject has the disease. Here that probability is P (D +|T −) = 1/[1+16.3π− /π+ ]. So we see that the probability that the observed evidence is misleading, EQ3, is nothing more than the posterior probability. The obvious (non-computational) difficulty with EQ3 is the specification of the prior probabilities, which are inherently subjective.23 However diagnostic tests are a rare example in which there exists a broad consensus regarding the prior probabilities. Typically, the disease prevalence is used to set the prior when it is reasonable to assume that the subject was selected at random from the population. Suppose the disease prevalence is π+ = 0.015. Then P (D − |T +) = 0.583 and P (D + |T −) = 0.0009. This means that positive results are not as reliable as negative results. In fact, observed positive results are misleading more than half the time in this population. This does not mean we are wrong when we interpret a positive result as evidence that the disease is present.24 It just means that in this population the strength of evidence provided by a positive result is not strong enough to outweigh our (very strong) prior knowledge about the presence or absence of disease. Before the test is given, the probability that the individual is diseased is only 1.5%. This probability increases to 41.7% after observing a positive test result.25 However, it remains more likely (58.3%) that the individual does not have the disease. Because EQ3 depends on prior probabilities, it remains context based. This is evidence. If not, then only a single observation would be needed to correctly identify the true hypothesis. 22 We need only be willing to make certain assumption about the prior probability of the hypotheses or disease state. 23 Admittedly there are varying degrees of subjectivity, as priors come from many different sources. 24 Surely a positive result from this test is not evidence the disease is absent! 25 This large increase is due to having a large likelihood ratio — strong evidence — of 47.
Likelihood and its Evidential Framework
499
why strong evidence in one study may not be strong enough evidence to provide the same sense of reliability in another study.26 From the expressions for EQ3, we see that larger likelihood ratios are more reliable in the sense that they are less likely to be misleading.27 EQ1 and EQ3 have an inverse relationship; as the evidence gets stronger, its potential to be misleading decreases. Lastly, it is worth reiterating that the likelihood principle indicates that EQ2 is irrelevant once the data have been collected. At that point, EQ1 and EQ3 are the only quantities of interest.
2.2
Hypothesis testing and significance testing
The cornerstone of mainstream statistical methods is an ad-hoc union of hypothesis testing and significance testing [Blume 2003, Goodman 1998]. Experiments are designed under the hypothesis testing framework and analyzed under a significance testing one. Because this union was unplanned, the tail area probability28 and what it represents is a source of continuing confusion. In the hypothesis testing framework the tail area probability represents EQ2 (i.e., the Type I error), but in the significance testing framework it represents EQ1 (i.e., the p-value). Moreover, there is no EQ1 in the hypothesis testing framework and there is no EQ2 in the significance testing framework. So it seems reasonable (and even quite natural) to many non-statisticians to merge the two approaches.29 Scientists will always look for each of the three evidential quantities; they represent important concepts and they have distinct roles in the scientific process. To impress the point, let’s consider a hypothesis test and a significance test in the diagnostic testing example from the previous section. A hypothesis test sets the null hypothesis as H0 : disease absent and the alternative as H1 : disease present. We ‘reject the null hypothesis’ when a positive result is observed and ‘fail to reject H0 ’ when a negative result is observed. The test would have a Type I error rate of 2% and a Type II error rate of 6%.30 According to various decision theoretic standards this is a good test. But problems arise if we try to interpret the results of the test as statistical evidence. First, failure to reject the null hypothesis cannot be taken as evidence for the null hypothesis. In fact, a negative result (i.e., failure to reject the null hypothesis) can never be interpreted as evidence that the disease is absent31 ; instead it is always interpreted as statistically inconclusive. Second, there is no 26 Context
affects the prior probabilities that are used in the calculation of EQ3. the prior probabilities are considered fixed. 28 Tail area probability is a probabilistic term representing the core calculation in a p-value and in a Type I error. 29 Fisher was well aware of this and warned: “In fact, as a matter of principle, the infrequency with which, in particular circumstances, decisive evidence is obtained, should not be confused with the force, or cogency, of such evidence” [Fisher, 1959, p93]. 30 A Type I error rate is the probability of rejecting a true null hypothesis and a Type II error rate is the probability of failing to reject a false null hypothesis. The Type II error rate is calculated under a pre-specified simple alternative hypothesis. 31 Absence of evidence is not evidence of absence. 27 Here
500
Jeffrey D. Blume
strength of evidence to report (i.e., no EQ1). The best we can do is report our decision to ‘reject’ or ‘fail to reject’, along with the error rates of our decision rule. But this is largely unsatisfactory from a scientific viewpoint, especially if the desire is to report the strength of evidence for the hypotheses of interest (e.g., weak, moderate or strong). This is why significance testing is employed at the end of the study. Significance testing involves calculating and reporting the p-value, as a measure of the strength of evidence against the null hypothesis. In this case, the p-value is 2% which would be considered strong evidence against the null because it is less than the common benchmark of 5%. Here too, it is not possible to generate evidence in favor of the null hypothesis; large p-values are interpreted as being inconclusive. For example, a negative test result, which yields a p-value of one32 , cannot be interpreted as evidence that the disease is absent. So we see from this example that while significance testing provides an EQ1 (i.e., the p-value), it does not provide an EQ2 or EQ3. EQ2, which would be calculated before the study is conducted, might be the probability of observing a small p-value when the null hypothesis is true (but EQ2 is not identically the p-value itself).33 Consider also a Bayesian approach that focuses on the posterior probabilities. In this case, after observing a positive result, the posterior probability of disease, P (D + |T +), is only 0.417. It remains unclear how that positive result ought to be interpreted if the posterior probability is used as the EQ1. This is because even after a positive test result the subject is more likely to be disease free (i.e., 41.7% < 50%). So should the positive result be considered evidence that the disease is absent? If yes, then this test can never generate statistical evidence that the disease is present (because the posterior probability of disease after a negative test result, P (D + |T −) = 0.0009 is very small). If no, then by what scale and context are we to interpret the posterior probability?34 EQ1 needs to be explicitly defined and we need to be told how to use it. Also, does is make sense to define EQ2 in a way that is dependent on the prior probabilities?35 So here too, the lack of an evidential framework leaves the approach so opened ended that its utility is not clear.
2.3
Background and notation
In order to consider more complex situations it is helpful to establish some notation, if only to help the reader distinguish between the three evidential quantities. 32 The p-value is the probability of observing the data or data more extreme. Thus, the p-value is the probability of observing a negative or positive test result when the null hypothesis is true. This probability is one. 33 The corresponding EQ3 would be the probability that the disease is absent given p = 0.02. 34 If the answer is “They should be interpreted in relation to how much they have changed from the prior probability” then this is nothing more than the likelihood paradigm. See [Royall, 1997]. 35 One example is “What is the probability that the posterior will be as large as 0.95 under various hypotheses?” The subtle caveat here is that we are providing long run frequency properties that are dependent on the priors and this may or may not rely on the ‘validity’ of the specification of the priors.
Likelihood and its Evidential Framework
501
Every effort is made to skip unnecessary mathematical complexities in the examples that follow. Suppose observations X1 , . . ., Xn are independent and identically distributed according to some density f (Xi ; θ). For a fixed sample size of n, let the likelihood function be Ln (θ) ∝ Πi f (Xi ; θ). There are two hypotheses of interest: the null, H0 : θ = θ0 , and the alternative, H1 : θ = θ1 . According to the Law of Likelihood, the strength of the evidence for H1 over H0 is measured by LR = Ln (θ1 )/Ln (θ0 ). An observed likelihood ratio will fall into one of three regions: LR ∈ [0, 1/k] indicating strong evidence for H0 over H1 , LR ∈ (1/k, k) indicating weak evidence in either direction, and LR ∈ [k, ∞) indicating strong evidence for H1 over H0 for some k ≥ 1. By convention, we use k = 8 or 32 to denote the transition from weak to moderate and from moderate to strong evidence.36
2.4
Misleading evidence
Misleading evidence, by definition, is strong evidence in favor of the incorrect hypothesis (e.g., observing LR = 47 when H0 is true or LR = 1/47 when H1 is true).37 We never know if observed evidence is misleading or not, but we can know the probabilities that a particular study will generate misleading evidence. They are represented, in general, by mis0 = P (LR ≥ k|H0 ) and mis1 = P (LR ≤ 1/k|H1 ) for some k > 1, typically 8 or 32. Here mis0 is the probability of observing misleading evidence under the null hypothesis and mis1 is the probability of observing misleading evidence under the alternative hypothesis. Both are EQ2s. Misleading evidence is seldom observed in the likelihood paradigm. Under quite general conditions, mis0 and mis1 are each bounded by 1/k, the so-called universal bound, and their average is bounded by 1/(k + 1) [Royall, 1997; 2000]. The universal bound is a crude device for controlling the probability of observing misleading evidence. The actual probability is often much less. As the sample size grows, both probabilities of observing misleading evidence (i.e., mis0 and mis1 ) converge to zero.38 Under some mild parametric conditions and in moderate to large samples, the probability of observing misleading evidence is approximately Φ[− ln k/c − c/2](≈ mis0 ), where c = |θ1 − θ0 |/(nI(θ0 ))−1/2 , Φ[·] is the standard normal cumulative distribution function, and I(θ0 ) is the Fisher information39 [Royall, 2000]. From this we get an asymptotic maximum40 probability of observ36 A likelihood ratio of 4 would be weak evidence in favor of the H and a likelihood ratio (LR) 1 of 1/47 would be strong evidence in favor of H0 . The weak evidence region — LRs between 1/8 and 8 — is important to future discussions. 37 Weak evidence — LRs between 1/k and k or typically 1/8 and 8 — are not considered to be misleading by definition. This is because weak evidence is inconclusive and has little force. In practice, Likelihoodists tend to ignore the directionality of weak evidence, characterizing it as inconclusive. See Royall [1997] for more on this topic. 38 This is an interesting and useful property for EQ2. Contrast this with hypothesis testing in which one of its EQ2s (the Type I error) remains constant regardless of the sample size. Thus errors are made even with an infinite sample size. 39 Under a normal model with known variance this approximation is exact. 40 The maximum is over all alternatives
502
Jeffrey D. Blume
√ √ ing misleading evidence of only Φ[− 2 ln k](≤ 1/[2k (π ln k)]), which is typically much less than the universal bound, 1/k. Understanding the behavior of EQ2 is important in evaluating study designs. Technical extensions are recently available [Royall, 2000; 2003; Blume, 2002; 2007]. EQ2 and EQ3 have different mathematical representations. In general, the third evidential quantities are P (H0 |data) = [1 + kπ1 /π0 ]−1 and P (H1 |data) = [1+π0 /kπ1 ]−1 where π0 = P (H0 ), π1 = 1−π0 , and k > 1 is the observed likelihood ratio in support of H1 over H0 . So, as the sample size increases, these probabilities are driven to zero or one by the likelihood ratio, which itself convergences to 0 or ∞ in support of the correct (true) hypothesis. Thus, for a fixed prior probability, larger observed likelihood ratios are more reliable and the degree of reliability can always be improved by increasing the sample size.
3
REEXAMINATION OF ACCUMULATING DATA (‘MULTIPLE LOOKS’)
Scenarios involving repeated examination of accumulating data (i.e., multiple looks) provide an excellent opportunity to illustrate the value of having an evidential framework. A common misconception is that there is no ‘penalty’ for multiple looks at data in the Likelihood paradigm.41 It is true that likelihood ratios (EQ1) are not modified by the number of looks at the data. But it is also true that there is a ‘penalty’. The caveat is that the penalty applies to EQ2 and not to EQ1 as it would with p-values. The probability of observing misleading evidence, EQ2, is indeed affected by the number of examinations of the data. Every additional examination of the data increase EQ2. The stopping rule is indeed relevant for determining how often a study design will yield misleading evidence. But once data are collected, the stopping rule is irrelevant; it does not affect the measure of evidence nor the potential for observed evidence to be misleading (EQ3). What many classical statisticians seem to find unsettling (or at least unfamiliar) about the likelihood paradigm is the separation of EQ1 from EQ2. EQ1 is the likelihood ratio; EQ2 is the probability that the likelihood ratio will be misleading. This is unfamiliar because the p-value is essentially used as both EQ1 and EQ2 indiscriminately. For example, consider two studies, one of which examines the data as it accumulates. Both studies happen to generate the same data set. Despite having identical data, they will report different p-values and claim to have different amounts of evidence. This is because the calculation of a p-value depends on the study design; it is calculated as if it were an EQ2, but interpreted as if it were an EQ1. The underlying problem is that just one number, the p-value, has to function as both EQ1 and EQ2 (and maybe even the EQ3). This problem is avoided when a well defined evidential framework exists. For example, under a likelihood approach these two studies would report the same likelihood ratio and therefore the same 41 P -values are penalized (i.e., discounted/inflated) for each look at accumulating data and for each planned look at the data that has not yet been executed. Likelihood ratios are not.
Likelihood and its Evidential Framework
503
observed strength of evidence in the data. What is different is their study design; one study had a much larger chance of generating misleading evidence before any data were collected. But this probability is irrelevant now that data have been observed. What might have been is irrelevant. All that is relevant is what was observed (EQ1) and the propensity for the observed data to be misleading (EQ3). Let’s consider a numerical example to fix ideas. Suppose I design a study to collect evidence about the average change in systolic blood pressure due to a new medication. Assume that these changes are normally distributed with a standard deviation of σ = 16 mm Hg. My null hypothesis is that the there is no change in mean blood pressure (H0 : µ = 0) and my alternative hypothesis is that the mean change is 8 mm Hg or one-half of the population standard deviation (H1 : µ = 8 or 0.5σ).42 With this information we can calculate how often certain study designs will generate misleading evidence in favor of the alternative hypothesis. First, consider a fixed sample size design. The study enrolls 25 people, collecting their before and after blood pressure measurements, and the data are examined after all observations have been collected. The probability of observing misleading evidence of strength 8 or more in this design is only 0.0187, which decreases to 0.0042 for (misleading) likelihood ratios of 32 or more.43 In contrast, consider instead a ‘truncated sequential’ design, where the data are examined after every observation is collected and the study is stopped as soon as strong evidence for the alternative hypothesis is observed or when data on 25 people is collected, whichever comes first. When the null hypothesis is true, the probability that this design generates misleading evidence of strength 8 or more is 0.0717, which decreases to 0.0123 for (misleading) likelihood ratios of 32 or more.44 Lastly, suppose it was possible to continue collecting data forever, so we could plan to collect observations until we observed evidence in support of the alternative hypothesis. This is a very biased design. Data are collected until we find strong evidence in support of the alternative hypothesis while strong evidence in favor of the null hypothesis is ignored. Thus, this design provides the best chance of collecting misleading evidence, because even if the null hypothesis is true we continue sampling until we find misleading evidence. But even in this biased design the probability of observing misleading evidence is only to 0.0934 (for evidence of strength 8 or more) and 0.0233 (for evidence of strength 32 or more).45 Notice that there is a very large probability that misleading evidence will never be generated. This is an important scientific safeguard; an investigator searching for evidence to support his pet hypothesis over the correct hypothesis is likely never to find such 42 σ is the standard deviation, µ is the mean, and H and H are the null and alternative 0 1 hypotheses. For convenience I expressed the alternative hypothesis in units of standard deviations. This makes the upcoming probability calculations simpler and adds to the generality of this example in the sense that any dependence on the stated value of the standard deviation is removed. 43 This probability is simply Φ[− ln k/c − c/2], noted in an earlier section, with c = 0.5 and k = 8 or 32. 44 See [Blume, 2008] for the formula for this calculation. 45 This probability is exp(−0.583 ∗ c)/k with c = 0.5 and k = 8 or 32 [Blume, 2002; 2007].
504
Jeffrey D. Blume
evidence when using the likelihood paradigm. It should be clear that even within the Likelihood paradigm a price is exacted for each examination of the data. That price is the inflation in the probability of observing misleading evidence, EQ2. One important characteristic of this probability is that the amount by which it increases converges to zero as the sample size grows [Blume, 2008]. Thus the probability of generating misleading evidence remains bounded even with an infinite number of examinations of the data. Simply put, the chance of observing misdealing evidence at a single point in time is less than the chance of observing misleading evidence at any point in time, although the latter remains bounded. More precisely we can write [Robbins. 1970]:46 P (LRn ≥ k|H0 ) < P (LRn ≥ k for any n = 1, 2, . . .|H0 ) ≤ 1/k As we have just seen, the probability of observing misleading evidence increases as the number of examinations increases. The chance of observing misleading evidence after collecting 25 participants is 0.0187, which increased to 0.0717 when the data were continuously monitored up to the 25 subjects, which increased to 0.0934 when the data were continually monitored until misleading evidence was obtained (i.e., there was no limit on the sample size). While the behavior of EQ2 is interesting and clearly depends on the study design, this behavior is independent of the observed EQ1 which does not depend on the study design. 4
MEASURING EVIDENCE ABOUT SEVERAL ENDPOINTS SIMULTANEOUSLY (‘MULTIPLE COMPARISONS’)
An evidential framework is also helpful when dealing with multiple comparisons for reasons similar to that just discussed in the context of reexamination of accumulating data. Consider a study in which 4 different measurements or endpoints are collected from each subject before and after a certain medical intervention. The goal of the study is to see if any of these endpoints change after the medical intervention. Our study will collect data from 6 participants.47 For simplicity and without loss of generality, we will also assume that the 4 endpoints are independent and normally distributed with a known variance.48 The null hypothesis is that of no change, H0 : µ = 0. Because our sample size is small, we set our alternative to be a change of one standard deviation, H1 : µ = µ1 = σ (i.e., (µ1 −µ0 )/σ = 1).49 √ The likelihood probabilities of observing √ √ misleading evidence are mis0 = Φ[− ln k/ n − n/2] and mis1 = Φ[− ln k/ n − 46 A general approximation is P (LR = k for any n = 1, 2, . . .|H ) ≡ exp(−0.583 ∗ c)/k [Blume n 0 2002,2007]. 47 I picked this for ease of calculation. Small sample sizes like this are, unfortunately, not unusual in genomic or in proteomic studies, although the number of genes or proteins being examined is often much larger than 4. 48 These assumptions allow us to focus on the conceptual underpinnings of the problem instead of its mathematics. No generality is lost. 49 In the example from the previous section the alternative was set to 1/ . 2
Likelihood and its Evidential Framework
505
√
n/2] with n = 6. When k = 8, these probabilities are both only 2%. The family-wise50 probability — the probability of observing misleading evidence on at least one endpoint — is 7.8% (=1-(1-0.02)4 ) under both the null and alternative hypotheses. When k = 20, the probability of observing misleading evidence on a single endpoint is 0.72% and the family-wise probability is 2.85%. We can see from this example that the sample size plays an important role. With n=12 observations and k=8, the probabilities of observing misleading evidence are only 1%, and the family-wise error rate drops to 3.9% from 7.8%. Note that in the Likelihood paradigm we do not choose a cutoff for ‘k’ to denote which likelihood ratios are significant or not. Likelihood ratios are descriptive. If we want to control the probability of observing misleading evidence, then this is done completely through the sample size. Also, it should be clear that the probability of observing misleading evidence on at least one endpoint will always increase as the number of endpoints increase. However, this probability is also controlled through the sample size and can be driven to zero with a large enough sample.51 Likelihood is different from hypothesis testing in that there are effectively three evidence regions: strong evidence for the null over the alternative, strong evidence for the alternative over the null, and weak evidence supporting either hypothesis over the other.52 In this example, there is a 33% chance of observing weak evidence.53 Strong evidence, when it is observed, tends to be reliable but one-third of the time we will be left with weak inconclusive evidence. In contrast, hypothesis testing has only two zones: ‘reject H0 ’ and ‘fail to reject H0 ’.54 For comparison purposes, it is instructive to consider what happens when we handicap the likelihood approach by removing the weak evidence zone. We do this by using k = 1, so that any amount of evidence (not just strong evidence) is potentially misleading. When k = 1, the probabilities are symmetrical and √ mis0 = mis1 = Φ[− n/2]. The two corresponding design probabilities (e.g., the probability of observing any evidence in support of the alternative when the null is true) increase from 2% to 11% and the family-wise rates increase from 7.8% to 37%. This happens because weak evidence is naturally less reliable.55 Let’s consider how this situation would be handled with a hypothesis √ test. The Type I and II errors of a hypothesis test are α and β = Φ[Z1−α − n]. With a one-sided Type I error rate of 5% and 6 observations, we get a Type II error rate 50 The family-wise error rate is defined as the probability of observing misleading evidence on at least one endpoint. 51 This is not so with hypothesis testing, as we will see later. 52 This inferential refinement is important in many respects and it should not be overlooked; it is the weak evidence that is most often misleading and should not be taken too seriously. 53 It is the same under either hypothesis. 54 Allowing for three zones is a major plus in my opinion. It completely resolves the asymmetry in hypothesis testing and significance testing that prevents them from generating evidence in support of the null hypothesis. 55 Judging from this, it seems best to call weak evidence inconclusive rather than overinterpreting it in practice. However, the upcoming comparison to hypothesis testing is insightful and worth pondering for academic purposes.
506
Jeffrey D. Blume
of 21%. But when there are 4 endpoints with this identical structure, a Bonferroni adjustment is required to control the overall Type I error.56 This adjustment results in a Type I error rate of 1.25% and a Type II error rate of 42% for each endpoint. The family-wise Type I error rate (i.e., the probability of making at least one Type I error) is now controlled at 5% (=1-0.98754 ), but the family-wise Type II error rate balloons to 89%. Remember that the comparative family-wise rates for likelihood were both 37%. Certainly, 37% is far from 5%, but it is even further from 89%. For a small increase in one rate there is a large drop in the other, and this occurs without the complexity of post-hoc adjustments to the p-value. For example, the adjusted error rates of hypothesis testing are now dependent on how many other endpoints are being considered in the study. Thus, changing the number of endpoints in the study affects the accuracy (i.e., EQ2) with which you can test other endpoints. In contrast, the likelihood paradigm leaves the chance of observing misleading evidence (EQ2) on a single endpoint the same regardless of how many other endpoints are being considered. A price is still paid for multiple comparisons (the family-wise probabilities increase), but each endpoint remains consistent within itself because there is a clear evidential framework. One way to balance the tradeoff between the different error rates is to use a metric like the average rate. The average error rate even has a nice interpretation — as the probability of making either error — if we are willing to assume that the null and alternative hypotheses are equally likely.57 With Likelihood, the average rate increases from 2% to 11% when we eliminate the weak evidence region. This yields a family-wise average rate of 37%, which is the probability of making at least one error, in any direction, over all the endpoints. In comparison, the hypothesis test has an average error rate of 13% for a single endpoint, which increases to 22% when that endpoint is properly adjusted for the multiple comparisons, and this yields an average family-wise error rate of 62% (0.62=1-(1-0.22)4 ). This simple example is a good illustration of the general state of nature; there is no reason to assume that a lack of adjustment necessitates an uncontrollable or outrageous increase in observing misleading evidence. In fact, the worst case average rate at which likelihood ratios are misleading is 40% less than the same rate of a hypothesis test that adjusts for the multiple comparisons. Perhaps more importantly, the likelihood rates can always be driven to zero by increasing the sample size (unlike the Type I error), so a large enough sample size virtually guarantees there will not be any misleading evidence, regardless of the number of comparisons. Figure 1, displays the family-wise average error rate for 1 and 4 endpoints under both likelihood and hypothesis testing (with multiple endpoints and a Bonferroni 56 Adjustments other than Bonferroni are available, but this is the most common. The type of adjustment is irrelevant to the thrust of this example in any case. All adjustments simply trade Type II errors for Type I errors. 57 These two types of errors can be weighted unequally in the likelihood paradigm when it makes sense to do so. But this is beyond the scope of this paper.
Likelihood and its Evidential Framework
507
0.4
Four Endpoints
0.1
0.2
0.3
Likelihood Hyp. Test Hyp. Test w/ Bonf. Adj.
One Endpoint
0.0
Family-wise average error rate (Probability of at least one error)
0.5
correction). Notice that for every sample size, the average error rate is smaller under Likelihood (when α = β in hypothesis testing the average error rates are equal). Notice also that the Likelihood average error rate (solid lines) converges to zero, instead of some fixed threshold, and hence can be driven as low as desired by increasing the sample size.
0
10
20
30
40
Sample Size
Figure 1. Average ‘error’ rates and family-wise average ‘error’ rates Figure 2, displays the individual rates themselves. The single solid line represents both likelihood rates. Unlike hypothesis testing, how often misleading evidence is observed on a single endpoint does not depend on the total number of endpoints. For example, consider Fig. 2 at 11 observations, where, for a single endpoint, each of the design probabilities is 5% (all dashed and dotted lines cross at 5% with 11 observations). However, the hypothesis testing rates change when the three other endpoints are considered (solid and dashed line). What happened is this: the Type I error rate for each endpoint decreased to 1.25% and Type II error rate increased to 14%. As a result, the new family-wise average error rate is 27% (family-wise Type II rate is 45% and family-wise Type I rate is 5%) instead of 19% (=1-0.954 ) for likelihood, where the individual probabilities remain at 5% regardless of the number of endpoints.
Jeffrey D. Blume
0.5
508
0.4 0.1
0.2
0.3
Likelihood ( 1 & 4 endpoints; both errors ) Hyp. Test Rates ( 1 endpoint ) Hyp. Test Rates ( 4 endpoints)
Type I Errors
0.0
Individual (Endpoint-Specific) Error Rates
Type II Errors
0
5
10
15
20
25
30
Sample Size
Figure 2. Type I and II error rates and probabilities of observing any evidence in the wrong direction (i.e., k=1) for Likelihood Thus, even in the Likelihood paradigm there is a traditional frequentist price exacted for having multiple endpoints in a single study: The probability of observing misleading evidence on at least one endpoint increases with each additional endpoint. The important difference is that this price is not exacted on EQ1, so that the evaluation of the evidence from any individual endpoint does not depend on how many other endpoints are under consideration.58 Moreover, the probability of observing misleading evidence on any single endpoint (EQ2) does not change with the number of endpoints (unlike hypothesis testing), thus conferring some internal consistency.
4.1
The potential for observed data to be misleading
The focus to this point has been on the probability of generating misleading evidence (EQ2) and how that varies with the addition of endpoints to the study 58 The only minor exception to this is when the underlying statistical model allows endpoints to be correlated, but this not the dependence that we are trying to avoid.
Likelihood and its Evidential Framework
509
in question. While EQ2 is an important piece of the puzzle, it is irrelevant once observations are collected. At the end of a study, EQ2 is often confused with the probability that the observed evidence is misleading (EQ3). Once we collect data, the design probability is irrelevant — the data are either misleading or not and we do not know which. So what we really want to know is how likely it is that the observed data are misleading. Unlike EQ2, the probability that the observed evidence is misleading, EQ3, does not depend on the study design. An example is helpful to fix ideas and I’ll continue with the multiple comparisons example from the previous section. In that example, we planned to collect six observations under one of two different designs: Study A has just a single endpoint, while Study B has three additional endpoints (4 total). Table 2 displays their design probabilities which have already been discussed at length. Table 2. Two study designs and their performance characteristics. The interior of table is expressed as percentages. *based on the average error rate. One endpoint
Four endpoints
(Study A) Likelihood with k = 8
Likelihood with k = 1
Hypothesis Testing
(Study B) ∗
α(k)
β(k)
Overall
α(k)
β(k)
Overall∗
Primary endpoint
2
2
2
2
2
2
Any endpoint
-
-
-
7.8
7.8
7.8
Primary endpoint
11
11
11
11
11
11
Any endpoint
-
-
-
37
37
37
Primary endpoint
5
21
13
1.25
42
22
Any endpoint
-
-
-
5
89
62
Now suppose the observed data yield a mean change on the first endpoint that was 1.01 standard deviations from zero. This evidence favors the alternative hypothesis by a factor of 21.34 (= L(µ1 )/L(µ0 )) regardless of which study generated the data (EQ1 does not vary by study). But the correct p-value is different depending on the design: p = 0.013 under Study A and p = 0.052 (=1-(1-0.013)4 ) under Study B.59 This inevitably raises the following question: Do these data 59 Adjustment
for p-values according to [Wright, 1992].
510
Jeffrey D. Blume
represent stronger evidence when they come from Study A? Or, when these data come from Study A, are they not “less likely to be misleading”, “more reliable” in some sense, or don’t they warrant more “confidence”? The answer to both of these questions is a resounding no because EQ3 does not vary by study. These data are equivalent as evidence about the unknown mean regardless of the study design; they are no more likely to be misleading under one design or the other. Here is why: these data are misleading if and only if the null hypothesis is true. And here (H0 |data) is the same regardless of the design. An application of Bayes theorem yields P (H0 |data) = 1/[1 + 21.34] = 0.045, which does not differ by study design.60 There is only a 4.5% chance that these data are misleading and this probability does not change if there were one, four or 2 million endpoints. Of course, changing the prior will change the degree to which we think these observed data are misleading. But EQ3 will not vary by study design unless the priors do. Thus we can use EQ3 to help assess the reliability of the observed evidence for a given prior, or as a sensitivity analysis across several priors. 5
COMMENTS
The Likelihood approach is used to measure the strength of evidence in observed data. It avoids the pitfalls of other statistical paradigms because it has a well defined evidential framework. The likelihood paradigm is often viewed as the common ground between the frequentists and Bayesians when it comes to measuring the strength of statistical evidence. This is because the likelihood function is often the only thing a frequentist and Bayesian can agree upon. Another reason is that likelihood ratios retain the desirable properties from both paradigms (irrelevance of sample space, good performance probabilities) while shedding the undesirable ones (dependence on prior distributions, ad-hoc adjustments to control error probabilities). BIBLIOGRAPHY [Barnard, 1949] G. A. Barnard. Statistical Inference. Journal of the Royal Statistical Society, Series B, 11: 115-149, 1949. [Birnbaum, 1962] A. Birnbaum. On the foundations of statistical inference (with discussion). Journal of the American Statistical Association 53: 259-326, 1962. [Blume, 2002] J. D. Blume. Likelihood methods for measuring statistical evidence. Stat Med. Sep 15, 21(17): 2563-99, 2002. [Blume and Peipert, 2003] J. D. Blume and J. F. Peipert. What your statistician never told you about p-values. J AM Assoc Gynecol Laparosc 10(4):439-444, 2003. [Blume, 2005] J. D. Blume. How to choose a working model for measuring the statistical evidence about a regression parameter. International statistical review, 73(2): 351-363, 2005. [Blume et al., 2007] J. D. Blume, L. Su, L. Acosta, R. M. Olveda, and S. T. McGarvey. Statistical evidence for GLM regression parameters: a robust likelihood approach Statistics in Medicine 26(15) 2919-36, 2007. 60 Assumes a non-informative prior, i.e., the hypotheses were equally likely before any data were collected.
Likelihood and its Evidential Framework
511
[Blume, 2008] J. D. Blume. How often likelihood ratio are misleading in sequential trials. Communications on Statistics-Theory and Methods, 38(8):1193-1206, 2008. [Cornfield, 1966] J. Cornfield. Sequential Trials, Sequential Analysis and the Likelihood Principle. The American Statistician 29(2): 18-23, 1966. [Edwards, 1971] A. W. F. Edwards. Likelihood. Cambridge University Press, London, 1971. [Fisher, 1959] R. A. Fisher. Statistical Methods and Scientific Inference, 2nd ed. New York, Hafner, 1959. [Goodman and Royall, 1988] S. N. Goodman and R. M. Royall. Evidence and scientific research. American Journal of Public Health 78(12):1568-1574, 1988. [Hacking, 1965] I. Hacking. Logic of Statistical Evidence. Cambridge University Press, New York, 1965. [Royall, 1997] R. M. Royall. Statistical Evidence. Chapman & Hall, London, 1997. [Royall, 2000] R. M. Royall. On the probability of observing misleading statistical evidence (with discussion). Journal of the American Statistical Association 95 (451), 760–767, 2000. [Royall and Tsou, 2003] R. M. Royall and T. S. Tsou. Interpreting statistical evidence using imperfect models: Robust adjusted likelihood functions. Journal of the Royal statistical society, series B. 65,part2,391-404, 2003. [Tsou and Royall, 1995] T. S. Tsou and R. M. Royall. Robust likelihoods. Journal of the American Statistical Association 90(419): 316-320, 1995. [Savage, 1964] L. J. Savage. The foundations of statistics reconsidered. In Studies in Subjective probability with Kyburg HE and Smokler HE (eds). New York: John Wiley and Sons, 1964. [Strug and Hodge, 2006a] L. Strug and S. E. Hodge. An alternative foundation for the planning and evaluation of linkage analysis: 1. Decoupling ‘error probabilities’ from ‘measures of evidence’. Human Heredity 61:166-188, 2006. [Strug and Hodge, 2006b] L. Strug and S. E. Hodge. An alternative foundation for the planning and evaluation of linkage analysis: 1. Implications for multiple test adjustments. Human Heredity 61:200-209, 2006. [van der Tweel, 2005] I. Van der Tweel. Repeated looks at accumulating data: to correct or not to correct? European journal of epidemiology, 20: 205-211, 2005. [Wright, 1992] S. P. Wright. Adjusted P -Values for Simultaneous Inference. Biometrics, vol. 48 1005-1013, 1992.
This page intentionally left blank
EVIDENCE, EVIDENCE FUNCTIONS, AND ERROR PROBABILITIES Mark L. Taper and Subhash R. Lele
1
INTRODUCTION
In a recent paper Malcolm Forster has stated a common understanding regarding modern statistics: Contemporary statistics is divided into three camps; classical NeymanPearson statistics (see [Mayo, 1996] for a recent defense), Bayesianism (e.g., [Jefferys, 1961; Savage, 1976; Berger, 1985; Berger and Wolpert, 1988]), and third, but not last, Likelihoodism (e.g., [Hacking, 1965; Edwards, 1987; Royall, 1997]). [Forster, 2006] We agree with this division of statistics into three camps, but feel that Likelihoodism is only an important special case of what we would like to call evidential statistics. In the sequel, we will try to justify our expansion of evidential statistics beyond the likelihood paradigm and to relate evidentialism to classical epistemology,1 and to classical statistics. For at least the last three quarters of a century a fierce battle has raged regarding foundations for statistical methods. Statistical methods are epistemological methods, that is, methods for gaining knowledge. What needs to be remembered is that epistemological methods are technological devices — tools. One does not ask if a tool is true or false, or right or wrong. One judges a tool as effective or ineffective for the task to which it will be applied. In this article, we are interested in statistics as a tool for the development of scientific knowledge. We develop our desiderata for knowledge developing tools in science and show how far the evidential statistical paradigm goes towards meeting these objectives. We also relate evidential statistics to the competing paradigms of Bayesian statistics and error statistics. Richard Royall [1997; 2004] focuses attention on three kinds of questions: “What should I believe?”, “What should I do?”, and “How should I interpret this body of observations as evidence.” Royall says that these questions “define three distinct 1 We have tried to frame our discussion for a philosophical audience, but we are scientists and statisticians. Omission of any particular citation to the philosophical literature will most likely represent our ignorance not a judgment regarding the importance of the reference.
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
514
Mark L. Taper and Subhash R. Lele
problem areas of statistics.” But, are they the right questions for science? Science and scientists do many things. Individual scientists have personal beliefs regarding the theories and even the observations of science. And yes, these personal beliefs are critical for progress in science. Without as yet unjustified belief, what scientist would stick his or her neck out to drive a research program past the edge of the known? In a more applied context, scientists are often called to make or advise on decisions large and small. It is likely that this decision making function pays the bills for the majority of scientist. But, perhaps the most important activity that scientists aspire to is augmenting humanity’s accumulated store of scientific knowledge. It is in this activity that we believe Royall’s third question is critical. Our thinking regarding the importance and nature of statistical evidence develops from our understanding (however crude) of a number of precepts drawn from the philosophy of science. We share the view, widely held since the eighteenth century, that science is a collective process carried out by vast numbers of researchers over long stretches of time [Nisbet, 1980]. Personally, we hold the view that models carry the meaning in science [Frigg, 2006; Giere, 2004; 2008]. This is, perhaps, a radical view, but an interest in statistical evidence can be motivated by more commonplace beliefs regarding models such as that they represent reality [Cartwright, 1999; Giere, 1988; 1999; 2004; Hughes, 1997; Morgan, 1999; Psillos, 1999; Suppe, 1989; van Fraassen, 1980; 2002] or serve as tools for learning about reality [Giere, 1999; Morgan, 1999]. We are strongly skeptical about the “truth” of any models or theories proposable by scientists [Miller, 2000]. We mean by this that although we believe there is a reality, which we refer to as “truth”, no humanly constructed model or theory completely captures it, and thus all models are necessarily false. Nevertheless, some models are better approximations of reality than other models [Lindsay, 2004], and some models are even useful [Box, 1979]. In the light of these basal concepts, we believe that growth in scientific knowledge can be seen as the continual replacement of current models with models that approximate reality more closely. Consequently, the question “what methods to use when selecting amongst models?” is perhaps the most critical one in developing a scientific method. Undoubtedly the works that most strongly influenced 20th century scientists in their model choices were Karl Popper’s 1934 (German) and 1959 (English) versions of his book Logic of Scientific Discovery. Nobel Prize winning scientist Sir Peter Medawar called this book “one of the most important documents of the twentieth century.” Popper took the fallacy of affirming the consequent2 seriously, stating that the fundamental principle of science is that hypotheses and theories can never be proved but only disproved. Hypotheses and theories are compared by comparing deductive consequences with empirical observations. This hypothetico-deductive framework for scientific investigation was popularized in the scientific community by Platt’s [1964] article on Strong Inference. Platt’s important contribution was 2 The logical fallacy that is made according to the following faulty reasoning 1) If A then B, 2) B, 3) Therefore A.
Evidence, Evidence Functions, and Error Probabilities
515
his emphasis on multiple competing hypotheses. Another difficulty with the falsificationist approach is the fact that not only can you not prove hypotheses, you can’t disprove them. This was recognized by Quine [1951]; his discussion of the under-determination of theory by data concludes that a hypothesis3 is only testable as a bundle with all of the background statements on which it depends. Another block to disproving hypotheses is the modern realization that the world and our observation of it are awash with stochastic influences including process variation and measurement error. When random effects are taken into consideration, we frequently find that no data set is impossible under a model, only highly improbable. Therefore, “truth” is inaccessible to scientist either because the models required to represent “truth” are complex beyond comprehension, or because so many elements are involved in a theory that might represent “truth” fully that an infinite number of experimental manipulations would be required to test such a theory. Finally, even if two full theories could be formulated and probed experimentally, it is not likely that either will be unequivocally excluded because in a stochastic world all outcomes are likely to be possible even if unlikely. What are we as scientists to do? We do not wish to lust after an unattainable goal; we are not so adolescent. Fortunately, there are several substitute goals that may be attainable. First, even if we can’t make true statements about reality, it would be nice to be able to make true statements about the state of our knowledge of reality. Second, if our models are only approximations, it would be nice to be able to assess how close to truth they are [Forster, 2002]. Popper [1963] was the first to realize that although all theories are false, some might be more truthlike than others and proposed his concept of verisimilitude to measure this property. Popper’s exact formulation was quickly discredited [Harris, 1974; Miller, 1974; Tichy, 1974], but the idea of verisimilitude continues to drive much thought in the philosophy of science (see [Niiniluoto, 1998; Zwart, 2001; Oddie 2007] for reviews). The results of this research have been mixed [Gemes, 2007]. The difficulty for the verisimilitude project is that, philosophically, theories are considered as sets of linguistic propositions. Ranking the overall truthlikeness of different theories on the basis of the truth values and content of their comprised propositions is quite arbitrary. Is theory A, with only one false logical consequence, truer than theory B, with several false consequences? Does it make a difference if the false proposition in A is really important, and the false propositions in B are trivial? Fortunately, as Popper noted [1976] verisimilitude is possible with numerical models where the distance of a model to truth can be represented by a single value. We take evidence to be a three-place relation between data and two alternate models4 . Evidence quantifies the relative support for one model over the other and 3A
scientific hypothesis is a conjecture as to how the world is or operates. reviewer has suggested that background information may be a necessary fourth part, but background information will enter the formalization either as part of the data, or as part of one or more of the models. 4A
516
Mark L. Taper and Subhash R. Lele
is a data based estimate of the relative distance from each of the models to reality. Under this conception, to speak of evidence for a model does not make sense. This then is what we call the evidential approach, to compare the truthlikeness of numerical models. The statistical evidence measures the differences of models from truth in a single dimension and consequently may flatten some of the richness of a linguistic theory. While statistical evidence is perhaps not as ambitious as Popper’s verisimilitude, it is achievable and useful. We term the quantitative measure of relative distance of models to truth an evidence function [Lele, 2004, Taper & Lele, 2004]. There will be no unique measure or the divergence between models and truth so a theory of evidence should guide the choice of measures in a useful fashion. To facilitate the use of statistical evidence functions as a tool for the accumulation of scientific knowledge we believe that a theory of evidence should have the following desiderata: D1)
Evidence should be a data based estimate of the relative distance between two models and reality.
D2)
Evidence should be a continuous function of data. This means that there is no threshold that must be passed before something is counted as evidence.
D3)
The reliability of evidential statements should be quantifiable.
D4)
Evidence should be public not private or personal.
D5)
Evidence should be portable, that is it should be transferable from person to person.
D6)
Evidence should be accumulable: If two data sets relate the same pair of models, then the evidence should be combinable in some fashion, and any evidence collected should bear on any future inferences regarding the models in question.
D7)
Evidence should not depend on the personal idiosyncrasies of model formulation. By this we mean that evidence functions should be both scale and transformation invariant.5
We do not claim that inferential methods lacking some of these characteristics cannot be useful. Nor do we claim that evidential statistics is fully formulated. Much work needs to be done, but these are the characteristics that we hope a mature theory of evidence will contain. Glossed over in Platt is the question of what to do if all of your hypotheses are refuted. Popper acknowledges that even if it is refuted, scientists need to keep their best hypothesis until a superior one is found [Popper, 1963]. Once we recognize 5 An example of scale invariance is that whether one measures elevation in feet or meters should not influence the evidence that one mountain is higher than another. An example of transformation invariance is that it should not matter in conclusions regarding spread whether spread is measured as a standard deviation or as a variance.
Evidence, Evidence Functions, and Error Probabilities
517
that scientists are unwilling to discard all hypotheses [Thompson, 2007] then it is easy to recognize that the falsificationist paradigm is really a paradigm of relative confirmation — the hypothesis least refuted is most confirmed. Thus, the practice of science has been cryptically evidential for at least half a century. We believe that it is important to make this practice more explicit. 2
QUANTIFYING EVIDENCE, LIKELIHOOD RATIOS AND EVIDENCE FUNCTIONS:
The issue of quantifying evidence in the data has always vexed statisticians. The introduction of the concept of the likelihood function6 [Fisher, 1912; 1921; 1922] was a major advance in this direction. However, how should one use the likelihood function? The main uses of the likelihood function have been in terms of point estimation of the parameters of the statistical model and testing of statistical hypotheses7 [Neyman and Pearson, 1933]. Neyman and his associates couched statistical inference as a dichotomous decision making problem (violating D2) whereas Fisher seems to have been much more inclined to look at statistical inference in terms of quantification of evidence for competing models.8 The use of significance tests and the associated p-values9 as a measure of evidence is most popular in applied sciences. However, their use is not without controversy [Royall, 1986]. The main problem with the use of p-values as a measure of evidence is that they are not comparative measures (violating D1). There is no explicit alternative against which the hypothesis of interest is being compared = [Royall, 1992]. Similarly, the use of Bayesian posterior probabilities as a measure of evidence is problematic leading to a number of contradictions. Posterior probabilities are not invariant to parameterization making them an unsatisfactory measure of evidence (by violating D7). Many Bayesian formulations involve subjective prior probabilities there by violating D4. The likelihood ratio (LR) is a measure of evidence with a long history. LRs explicitly compare the relative support for two models given a set of observations (D1) and are invariant to parameter transformation and scale change (D7). Barnard [1949] is one of the earliest explicit expositions. Hacking [1965] made it even more explicit in his statement of the law of the likelihood.10 Ed6 The
likelihood is numerically the probability of the observations (data) under a model and is considered a function of the parameters, that is: L(θ; x) = f (x; θ). Here L is the likelihood function, θ is the parameter or parameter vector, x is the datum or data vector, and f is a probability distribution function. The likelihood is not a probability, as it does not integrate to one over the parameter space. 7 A statistical hypothesis is a conjecture that the data are drawn from a specified probability distribution. Operationally, one tests scientific hypotheses by translating them into statistical hypotheses and then testing the statistical hypotheses (see [Pickett et al., 1994] for a discussion) 8 “A likelihood based inference is used to analyze, summarize and communicate statistical evidence. . . ” [Fisher, 1973, page 75]. 9 the p-value is the probability (under a null hypothesis) of observing a result as or more extreme than the observed result. 10 According to the law of likelihood model, 1 is supported over model 2 if based on the same data the likelihood of model 1 is greater than the likelihood of model 2.
518
Mark L. Taper and Subhash R. Lele
wards [1992] is another exposition that promoted the law of the likelihood and the likelihood function. Royall [1997] perhaps makes the best pedagogic case for the law of the likelihood and expands its scope in many significant ways. In particular, his introduction of error probabilities and their use in designing experiments is extremely important. Further, his promotion and justification of profile likelihood and robust likelihood as a measure of evidence is a significant development. Although, in the first part of his book Royall strongly emphasizes the likelihood principle11 (which is different than the law of likelihood) his use of ad hoc adjustments to likelihoods such as the profile likelihood in the presence of nuisance parameters clearly violate the likelihood principle because consideration beside just the likelihoods influence the inference [Fraser, 1963; Berger and Wolpert, 1988]. Similarly, the consideration of error probabilities depends on the sample space and hence they violate the likelihood principle as well [Boik, 2004]. It is clear the error probabilities, if taken as part of the evidence evaluation, violate the likelihood principle. We hasten to point out that Royall does not suggest using error probabilities as part of evidence. We agree that error probabilities are not evidence, but feel that they can play an important role in inference. We expand on this discussion in the next section. Royall and others take the law of likelihood as a given and then try to justify why it makes sense in terms of the adherence to the likelihood principle, universal bound for probability of misleading evidence and other intuitive criteria. On the other hand, when faced with the problem of nuisance parameters12 or the desire for model robustness,13 they propose the use of other ad hoc methods but their justifications do not carry through. Nevertheless, in many practical situations, one may not want to specify the model completely and use methods based on mean and variance function specification alone such as Generalized Linear Models [McCullogh and Nelder, 1989]. A further problem is that the error probabilities are computed assuming one of the hypotheses is in fact true (violating D1). To circumvent these issues and to try to give a fundamental justification for the use of the likelihood ratio as a measure of evidence, Lele [2004] introduced the concept of evidence functions. The first question that Lele [2004] poses is: what happens to the likelihood ratio when true distribution is different than either of the competing hypotheses? A simple application of the law of large numbers shows that as the sample size increases, the log-likelihood ratio converges to the difference between the KullbackLeibler divergence14 between the true distribution and hypothesis A and Kullback11 The likelihood principle states that all evidential meaning in the data is contained in the likelihood function. 12 Nuisance parameters are parameters that must be included in the model for scientific reality, but are not themselves the entities on which inference are desired. Nuisance parameters are discussed in more detail in the section on multiplicities. 13 Model robust techniques are designed so that inference on the model elements of interest can be made even if nuisance portions of the model may be somewhat incorrect. 14 The Kullback-Leibler divergence is one of the most commonly used measure of the difference of one distribution from another. If f (x) and g(x) are probability distributions, then KL(f, g) is
Evidence, Evidence Functions, and Error Probabilities
519
Leibler divergence between the true distribution and hypothesis B. If hypothesis A is closer to the truth than hypothesis B is, the likelihood ratio leads us to hypothesis A. Thus, it follows that strength of evidence is a relative measure that compares distances between the true model and the competing models [Lele, 2004]. Immediate consequences of this observation are the questions: 1) Can we use divergence measures other than Kullback-Leibler to measure strength of evidence? 2) Is there anything special about Kullback-Leibler divergence? A study of these two questions led Lele [2004] to following conclusions: First, different divergence measure based quantification may be compared in terms of the rate at which the probability of strong evidence converges to one. And second, the Kullback-Leibler divergence measure has the best rate of convergence among all other measures of evidence. This result holds provided full specification of the probabilistic model is available, there are no outliers in the data, and the true model is one of the competing hypotheses. However, one can make quantification of evidence robust against outliers, an important practical consideration, by using divergences other than Kullback-Leibler divergence.15 Further, one can quantify strength of evidence in situations where one may not want to specify the full probabilistic model but may be willing to specify only mean and variance functions by using Jeffrey’s divergence measure. One may also use an empirical likelihood ratio or other divergence measures based on an estimating function. Thus, one can justify the use of a variety of modified forms for the likelihood ratio such as conditional likelihood ratios, profile likelihood ratios, and composite likelihood ratios as measures of evidence because they correspond to some form of relative divergence from “truth”. Other notable conclusions that follow from the generalization of the law of likelihood in terms of divergence measures are: 1) The design of experiment and stopping rules do matter in the quantification of evidence if divergences other than Kullback-Leibler divergence are used [Lele, 2004, discussion]. This goes against the pre-eminence of the likelihood principle in the development of Royall [1997] but criticized by Cox [2004]. And 2), the concept of error probabilities needs to be extended to allow for the fact that the class of hypothesized models seldom contains the true model. In the following, we suggest how the second issue could be addressed. It also leads to quantifying post-data reliability measures for the strength of evidence.
the average for observations, x, drawn from f (x) of log(f (x)/g(x)). KL(f, g) is 0 when f and g are the same distribution and is always greater than 0 if the two distributions differ. Technically KL(f, g) is a divergence not a distance because KL(f, g) need not equal KL(g, f ). 15 Other common divergences/distances between statistical distributions are the Pearson chisquared distance, the Neyman chi-squared distance, the symmetric chi-squared distance, and the Hellinger distance [Linhart & Zucchini, 1986; Lindsay, 2004; Lindsay et al., 2007].
520
Mark L. Taper and Subhash R. Lele
3
THE PROBABILITY OF MISLEADING EVIDENCE AND INFERENCE RELIABILITY:
Richard Royall’s introduction of the concepts of the probability of misleading evidence (M ) and the probability of weak evidence (W ) constituted a major advance in evidential thinking. Misleading evidence is defined as strong evidence for a hypothesis that is not true. The probability of misleading evidence is denoted by M or by M (n, k) to emphasize that the probability of misleading evidence is a function of both sample size and the threshold, k, for considering evidence as strong. The probability of weak evidence is the probability that an experiment will not produce strong evidence for either hypothesis relative to the other. When one has weak evidence, one cannot say that the experiment distinguishes between the two alternative hypothese in any meaningful way. These probabilities link evidential statistics to the error statistical thread in classical frequentist analysis. As experimental design criteria, M and W are superior to the type I (design based probability of rejecting a true null hypothesis = α) and type II (design based probability of failing to detect a true alternative hypothesis = β) error rates of classical frequentist statistics because both M and W can be simultaneously brought to zero by increasing sample size [Royall, 1997; 2004; Blume, 2002]. For Royall and his adherents there are three quantities of evidential interest: 1) the strength of evidence (likelihood ratio), 2) the probability of observing misleading evidence16 (M ), and 3) the probability that observed evidence is misleading.17 This last is not the same as M and it requires prior probabilities for the two alternative hypotheses.18 Royall claims that M is irrelevant post data and that M is for design purposes only. In common scientific practice, all three measure have often been freighted on the p-value. There are a vast number of papers discussing common misconceptions on the interpretation of p-value (e.g. Blume &Peipert. 2003; Goodman 2008). The strength of Royall’s approach is that these three quantities are split apart and can be thought about independently.
4
GLOBAL & LOCAL RELIABILITY
There is a deep reason why M and other flavors of error statistics are important in statistical approaches to scientific problems. We strongly believe that one of the foundations of effective epistemology is some form of reliabilism. Under reliabilism, 16 Given two statistical hypotheses (H and H ) the probability of misleading evidence for H 1 2 2 over H1 is M = P1 ([L(x; H2 )/L(x; H1 )] > k); where M is the probability of misleading evidence, P1 (.)is the probability of the argument under hypothesis 1, and k is an a priori boundary demarcating the lower limit of strong evidence. 17 The probability that observed evidence for H is misleading = π(H )/[π(H )+π(H )·LR ], 2 1 1 2 ob where π(H1 ) is the prior probability of H1 and LRob is the observed likelihood ratio (see [Blume, 2002]). 18 We do not attach much importance to this third quantity, meaningful priors are rarely available, its primary purpose in its presentation is to clarify that it is indeed distinct from M.
Evidence, Evidence Functions, and Error Probabilities
521
a belief (or inference) is justified if it is formed from a reliable process [Goldman, 1986; 2008; Roush, 2006]. Reliability has two flavors, global reliability and local reliability. Global reliability describes the truth-tracking or error avoidance behavior of an inference procedure over all of its potential applications. Examples of global reliability measures in statistics are Neyman/Pearson test sizes (α and β) and confidence interval levels. These measures describe the reliability of the procedures not individual inferences. Individual inferences are deemed good if they are made with reliable procedures. For example if α=0.05 the scientist can feel comfortable saying: “Only in 5% of the cases would this procedure reject a null hypothesis in error, so I can have confidence in the rejection that I have currently observed.” Royall’s probability of misleading evidence M is this global kind of a measure and hereafter we will refer to it as the global reliability of the design or MG. Local reliability on the other hand is the “truth-acquisition or error avoidance in scenarios linked to the actual scenario in question” [Goldman, 2008]. Fisherian p-values and Mayo’s test severity [Mayo, 2004; Mayo and Cox, 2006] are local reliability measures. It is easy to see that the p-value is a local reliability measure because the error probabilities are calculated relative to the specific observed results. Both local and global reliability are useful in generating scientific knowledge [Goldman, 1986]. A local reliability measure or measures would be useful within the context of the evidential paradigm. 5
5.1
LOCAL RELIABILITY AND THE EVIDENTIAL PARADIGM
Local reliability under the alternatives
We define the local reliability of the evidence, ML , as the probability that evidence for one model is strong as or stronger than the evidence actually observed could have been generated under the alternative. As a post data measure, ML is not the same as MG conceptually or quantitatively. ML is also distinct from the probability that the observed evidence is misleading in several aspects. First, ML involves a tail sum and the probability that the evidence is misleading does not, and second, the probability that the evidence is misleading depends on the prior probability of the two models, while ML does not. Royall [1997] presents a surprising but powerful and simple result that sets bounds on the magnitude of MG . He shows that PA (X) 1 PB ≥q ≤ PB (X) q where q is any constant.19 In particular if q = k, the threshold for strong evidence, we see that the probability of misleading evidence, MG , must be less than 1/k. 19 One proof follows directly from substitution into a classic theorem in probability called Markov’s inequality which states that if Y is a nonnegative random variable then P (Y ≥ q) ≤ E(Y )/q where E(.) denotes expectation. Substituting the likelihood ratio for Y we
522
Mark L. Taper and Subhash R. Lele
Royall calls this the universal bound on the probability of misleading evidence. The actual probability of misleading evidence may often be much lower. Further, as the likelihood ratio of observed strong evidence (LR ob ) is by definition greater than or equal to k, the local probability of misleading evidence, ML , must be less than or equal to the global probability of misleading evidence. That is: 1 1 PA (X) ≥ LRob ≤ ≤ MG ≤ . ML = P B PB (X) LRob k One question that springs to mind is why was a post data reliability measure not included in Royall’s original formulation of the evidential paradigm? While only Royall could really answer this question, should he choose to break his silence, it is easy to see that within Royall’s context there is no need for an explicit measure of local reliability. Royall’s work was focused on the comparison of simple or point models. In the comparison of simple models, the likelihood ratio and the p-value contain the same information allowing one to transform from one to the other [Sellke et al., 2001] and ML is redundant. However, when one begins to expand the evidential approach, as one must to develop a complete statistical toolkit, ML does seem to become an interesting and useful evidential quantity.
5.2
Local reliability under the unknown truth
The likelihood ratio or any other measure of strength of evidence is a point estimate of the difference between divergences from truth to A and truth to B. The first issue we need to address is to quantify the distribution of the strength of evidence under hypothetical repetition of the experiment. In Royall’s formulation, this is done under either hypothesis A or hypothesis B. But as we have noted, neither of these hypotheses need be the true distribution. Royall’s formulation is useful for pre-data, sample size determination or optimal design issues. The local reliability ML defined in the previous section is potentially a useful post data quantity, but it is still calculated under the explicit alternative hypotheses and not the underlying true distribution. Once the experiment is conducted or the observations made, a non-parametric estimate of the true distribution is accessible. One can conduct a non-parametric bootstrap20 to obtain an estimate of the distribution of the strength of the evidence under this true distribution. This can be used inferentially in several different ways: First, one can obtain a bootstrap based confidence interval for the likelihood ratio: This tells us if the experiment is repeated infinitely often (under the true model), what would be the distribution of likelihood ratios? This information have PB (PA (x)/PB (x) ≥ q) ≤ EB (PA (x)/PB (x))/q. By R R definition, EB (PA (x)/PB (x)) = PB (x)(PA (x)/PB (x))dx. This last integral simplifies to PA (x)dx which integrates to 1 because PA (x) is a probability distribution. Thus, PB (PA (x)/PB (x) ≥ q) ≤ 1/q as claimed. 20 The bootstrap is one of the profound statistical developments of the last quarter centuary. The unknown distribution of a statistic can be estimated non-parametrically by repeatedly resampling data sets from the observed data set and recalculating the statistic for each (see [Efron & Tibshirani, 1993].
Evidence, Evidence Functions, and Error Probabilities
523
could be presented either as intervals, as a curve that represents a post data measure of the reliability of the estimated strength of evidence, or transformed to a support curve [Davison and Hinkley, 1992; Davison and Hinkley, 1997, Sellke et al., 2001]. Both the upper and lower confidence limits are informative. The lower limit says “it is not likely that the true divergence is less than this”, while the upper limit says that “it is not likely that the true divergence is greater than this.” A second alternative measure that can be calculated using a bootstrap is the proportion of times hypothesis A will be chosen over hypothesis B (proportion of times LR>1). This measure can be interpreted as a post data reliability of model choice. Although the bootstrap quantities defined above involve tail-sums, they are quite distinct from either the global or local probabilities of misleading evidence discussed in the previous section. Whereas MG or ML are counterfactual, answering the question if the correct model was the one not indicated by the evidence how probable is a mistaken evidential assessment as strong as the one observed, the bootstrap tail-sum is a direct assessment of the reliability of the observed evidence. Furthermore, this assessment is made under truth, and not under either hypothesized model.
6
EVIDENCE AND COMPOSITE HYPOTHESES
The evidential approach has been criticized (e.g. [Mayo and Spanos, 2006]) as a toy approach because the LR can’t compare composite hypotheses.21 This criticism is simultaneously true, a straw man, a good thing, and false. It is true because one can only strictly rank composite hypotheses if every member of one set is greater than every member of the other [Royall, 1997; Blume, 2002; Forster and Sober, 2004]. But, the statement is also a strawman because it implies that the evidential paradigm isn’t able to do the statistical and scientific work done using composite hypotheses, which is patently false. Classically, composite hypotheses are used to determine if a point null is statistically distinguishable from the best alternative, or to determine if the best supported alternative lies on a specified side of the point null. Royall [1997, chapter 6] gives a number of sophisticated examples of doing real scientific work using the tools of the support curve, the likelihood ratio, and the support interval. Further, the inability of the LR to compare composite hypotheses is a good thing because Royall is correct in that the composite H can lead to some ridiculous situations. Consider the comparison of hypotheses regarding the mean of a normal distribution with a known standard deviation of 2 as in [Mayo and Cox, 2006]. H0 : µ ≤ 12 vs: H1 : µ > 12. A µ of 15 and a µ of 10,000 are both in H1 . But, if 15 is the true mean, a model with µ = 0 (an element of H0 ) will describe data generated by the true model much better than will µ = 10,000 (an element of H1 ). This contradiction will require some awkward circumlocution by Neyman/Pearson adherents. Finally, the statement is false if under the evidence function concept discussed above we expand the evidential paradigm to include 21 Composite
hypotheses are hypotheses that subsume multiple hypotheses.
524
Mark L. Taper and Subhash R. Lele
model selection using information criteria. Comparing composite hypotheses using information criteria is discussed in more detail in the next section. 7
SELECTING BETWEEN COMPOSITE HYPOTHESES
We suggest that, under the evidential paradigm, the composite hypothesis problem be recast as a model selection problem among models with different numbers of free parameters. In the simple example given above H0 is a model with no free parameters while H1 is a family of models indexed by the free parameter µ. Model selection using information criteria22 compares models by estimating from data their relative Kulback-Leibler distance to truth (Burnham and Anderson 2002). This is a reasonable evidence function. With multiple models, all models are compared to the model with the lowest estimated KL distance to truth. The model selection procedures are blind to whether the suite of candidate models is partitioned into composite hypotheses. One can consider that the hypothesis that contains the best supported model is the hypothesis best supported by the data. No longer comparing all points in one hypothesis to all points in another, but in effect, comparing the best to the best. Where best is defined as the model with the lowest information criterion value. This solution is neither ad hoc (to the particular case) nor post hoc (after the fact/data). The comparison of composite hypotheses using information criteria is not a toy procedure, and can do real scientific work. Taper and Gogan [2002] in their study of the population dynamics of the Yellowstone Park northern elk herd were interested in discerning whether population growth was density dependent or density independent. They fitted 15 population dynamic models to the data and selected amongst them using the Schwarz information criterion (SIC). The best model by this criterion was a density dependent population growth model and difference between the SIC value for this model and that of the best density independent model was more than 5, a very highly significant difference [Burnham and Anderson, 2002]. There were a number of statistically indistinguishable density dependent models that all fit the data well, making identifying the best model difficult. Nevertheless, it is clear that the best model is in the composite hypothesis of density dependence, not the composite hypothesis of density independence. 8
EVIDENCE AND THE CHALLENGES OF MULTIPLICITIES
As pointed out by Donald Berry [2007] multiplicities are the bane of all statistics. By multiplicities we mean the vague class of problem that are not simple, including 22 Information
criteria are a class of measures for estimating the relative KL distance of models to “truth”. In general information criteria include both the number of parameters and the number of data points in their calculation. Information criteria attempt (with varying degrees of success) to overcome the problems of overfitting that would result if comparisons were made on the basis of likelihoods alone. Technically, only order consistent iformation criteria meet all the criteria of evidence functions.
Evidence, Evidence Functions, and Error Probabilities
525
multiple hypotheses, multiple comparisons, multiple parameters, multiple tests, and multiple looks at the data. Evidential statistics is not immune to the effects of multiplicities, but the evidential paradigm does have approaches to tackling these problems, which are in some cases superior to classical approaches.
8.1
Nuisance parameters
Nuisance parameters occur when reality and data are complex enough to require models with multiple parameters, but inferential interest is confined to a reduced set of parameters. Making inferences on the parameters of interest that isn’t colored by the nuisance parameters is difficult. Marginal or conditional likelihoods can be used. These are proper likelihoods23 so all the likelihood ratio based evidential techniques can be employed. Unfortunately, marginal and conditional likelihoods are not always obtainable. Royall [2000] recommends the use of profile likelihood24 ratio as a general solution. Royall feels that the profile likelihood ratio is an ad hoc solution because true likelihoods are not being compared. Nevertheless, he finds the performance of the profile likelihood ratio to be very satisfactory. In our expanded view of evidence, the profile likelihood ratio is not ad hoc because the profile likelihood ratio can be shown to be an evidence function. Royall [2000] shows that the probability of misleading evidence from a profile likelihood ratio is not constrained by the universal bound, and can exceed 1/k. Thus, even in this first expansion of the concept of evidence from the likelihood ratio of two simple hypotheses we see that ML is decoupled from the likelihood ratio and contains distinct information.
8.2
Sequential analyses — multiple tests of the same hypothesis
Another multiplicity that the evidential approach handles nicely is sequential analysis. Multiple looks at data while it is accumulating does not diminish the strength of evidence of the ultimate likelihood ratio, unlike p-value based approaches, which must carefully control the spending of test size in multiple analyses [Demets and Lan, 1994]. Further, the universal bound that MG ≤ 1/k is still maintained. This subject is developed in detail by Blume [2008]. Under a sequential sampling design, observations will be terminated as soon as the likelihood ratio passes the threshold k. Consequently, the local probability of misleading evidence will only be slightly lower than the global probability of misleading evidence. 23 A proper likelihood is directly associated with and numerically equal to some probability distribution function for the observations. Proper likelihoods are often referred to in the literature as “true likelihoods.” 24 Profile likelihoods are functions of the data and the parameter or parameters of interest (i.e. not the nuisance parameters). The value of the profile likelihood is the maximum value the full likelihood could take under all possible values of the nuisance parameters.
526
8.3
Mark L. Taper and Subhash R. Lele
Multiple comparisons: Many tests of different hypotheses
Multiple comparisons place a heavy burden on scientists. Scientists are bursting with questions, and design experiments and surveys to answer many questions simultaneously. As a classical error statistical analysis sets an a priori level of type I error on each test, increasing the number of tests increases the probability that at least one of them will be significant by chance alone. To control the family wide error rate, scientists have been forced to decrease the size of individual tests using lower type I error rates. The price of this move is that the power of the individual tests to detect small but real differences is diminished. The scientist makes fewer errors, but gets fewer things right as well. An evidential analysis is not immune to the increase in the family wide probability of error with an increasing number of tests. If we define MG (n, N, k) as the pre-experiment probability of at least 1 misleading result at level k amongst N comparisons with ni observations each, then MG (n, N, k) = 1 −
N Y
i=1
(1 − MG (ni , k)) ≤
N X i=1
MG (ni , k) ≤ N/k .
So the global probability of misleading evidence increases with the number of comparisons in the same fashion that the family wide type I error does. As the local probability of misleading evidence of a comparison is always less than or equal to the global probability of misleading evidence for the comparison, the local family wide probability of misleading evidence will also be less than the global family wide probability of misleading evidence. Although multiple comparisons lays a burden on evidential analysis similar to that laid on a classical error statistical analysis, the evidential approach has more flexible ways of mitigating this burden. The ways family wide misleading evidence can be controlled depends on whether sample size is constrained or if it can be increased, either before or after the initial experiment is conducted. If the sample sizes in the comparisons are fixed, then the only control strategy is to increase the strong evidence threshold k, in direct analogy to the test size adjustment of classical multiple comparisons. This will decrease MG (n, N, k), but with the cost that the probability of weak evidence (W ) will increase for all comparisons, similar to the classical decrease in power resulting from test size adjustment. However, if sample size is flexible then several alternative strategies become available. Strug and Hodge [2006] give a clear description of three scenarios for controlling the global family wide misleading evidence in multiple comparisons by adjusting sample size. Strategy 1: Increase sample size in all comparisons before the experiment. MG (ni , k) can be brought to any desired level without changing the strong evidence threshold k for each comparison by increasing sample size. Consequently, MG (n, N, k) can also be brought to any desired level. This strategy has the advantage that W will be simultaneously decreased for all comparisons, but the cost in terms of increased sample size is high. Strategy 2: Increase sample size for only those comparisons exhibiting strong evidence. This requires an analysis of
Evidence, Evidence Functions, and Error Probabilities
527
interim data, but we have seen that has little untoward influence in an evidential analysis. MG (n, N, k) can be brought to any desired level, but W will remain unaltered. The sample size cost is less than Strategy 1. Finally, in Strategy 3, the scientist would increase sample size for comparisons with interim strong or weak evidence, but not strong opposing evidence. MG (n, N, k) is controllable at any desired level, W is reduced, and sample size costs are intermediate between Strategies 1 and 2..
8.4
Multiple candidate models
One of the problems of multiplicities that the evidential paradigm is most susceptible to is the difficulty caused by too many candidate models. One of the great strengths of the evidential paradigm is that it allows and encourages the comparison of multiple models. This allows a more nuanced and accelerated investigation of nature. However, the risk is that, if too many models are considered with a single data set, a model that is not really very good will be favored by chance alone. This has been called model selection bias [Zucchini, 2000; Taper, 2004]. The problem of model selection bias has led Burnham and Anderson and their acolytes strongly and repeatedly argue against “data dredging” and for compact candidate model sets defined by a priori scientific theory (e.g. [Anderson et al., 2000; Anderson and Burnham, 2002, Burnham and Anderson, 2002]). There is considerable merit to these recommendations, but the cost is that ability to broadly explore model space is reduced. As with multiple comparisons, several alternatives are possible for an evidential analysis, each with costs and benefits. One suggestion made by Taper and Lele [2004] is to increase k, the threshold for strong evidence. This would decrease the probability of misleading evidence over the entire analysis, but at the cost of potentially ending with a large number of indistinguishable models. Another alternative suggested [Bai et al., 1999; Taper, 2004] is to increase the parameter penalty in coordination with the increase in the number of candidate models. If effects are tapered, these procedures select models with large effects of each parameter. Here the ability to detect model fine structure is traded for the ability to investigate a broad number of models. 9
DISCUSSION
Evidentialism is an adolescent statistical paradigm, neither infantile nor mature. It is capable of much scientific work, but with considerable scope for technical improvement. Strict Royallist evidentialism is rapidly gaining adherents in epidemiology and medical genetics, while information criteria based inference is a major force in ecological statistics. The elevation of evidentialism to a practical statistical approach is due to Royall’s introduction of the concepts of weak and strong evidence and of misleading evidence. The introduction in this paper of local reliability (post data) currently serves to clarify the epistemic standing of evidential practices. The warrant for
528
Mark L. Taper and Subhash R. Lele
post data inference using the likelihood ratio as the strength of evidence is the local reliability of the evidence. The reliability of the evidence is a function of the local probability of misleading evidence, ML , which is directly linked to LR. One interesting observation is that local reliability is in general much greater than NP error rates indicate. Further, local reliability is in general greater than global probability of misleading evidence, MG (a priori evidential error rate), indicates. We have also suggested several measures local reliability that do not develop their probability assessments from the explicit alternative models under consideration but instead use a non-parametric bootstrap to estimate local reliability under the unknown true distribution. All of these are valid assessments of post data error probabilities. Which of them proves most helpful in constructing scientific arguments will become clear through time and use. Together with ML these bootstrap methods provide the evidential approach a rich suite of tools for post-data assessment of error probabilities that are uncoupled from the estimation of the strength of evidence. Science needs mechanisms for the accumulation of sound conclusions (sensu [Tukey, 1960]). A major rival for Evidentialism as a philosophically sound (in our eyes) system for the advancement of science is the “Error Statistical” brand of Neyman-Pearson analysis promoted by Deborah Mayo. We dismiss Bayesianism for its use of subjective priors and a probability concept that conceives of probability as a measure of personal belief. Bayesianism is held by many philosophers as the most appropriate method of developing personal knowledge. This may be, but is irrelevant to the task at hand. Science depends on a public epistemology not a private one. The Bayesian attempts to bridge the gap between the private and the public have been tortured. It is not that we believe that Bayes’ rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show. Bayesian techniques that successfully eliminate the influence of the subjective prior [Boik, 2004], such as the Bayesian Information Critierion [Schwarz, 1978] or data cloning [Lele et al., 2007; Ponciano et al., 2009], may be useful. Our difficulties with Mayo’s Error Statistical approach are didactic. Mayo speaks of probing a hypothesis. In our radical falabist view of science models or hypotheses can only be supported relative to other models or hypotheses. Certainly, there actually is a cryptic alternative hypothesis in Mayo’s calculations, but we believe that the linguistic suppression of the alternative is counterproductive. A probed hypothesis is smugly self-congratulatory where a pair of hypotheses compared evidentially invites scientist to throw new hypotheses into the mix. Fundamentally, Mayo’s approach represents a fusion of Fisher’s significance test, a post data error calculation, with Neyman-Pearson hypothesis testing. We think that the use of post data error as the strength of evidence and the shift in emphasis from inductive decision to inductive inference are both helpful steps. However, scientists have struggled with the logics of both Fisherian significance tests and
Evidence, Evidence Functions, and Error Probabilities
529
Neyman-Pearson tests. For generations, it has been a cottage industry for statisticians to write white papers trying to explain these concepts to working scientists. Nothing in Mayo’s reformulation will ease these difficulties. On the other hand, the evidential paradigm presents scientists with powerful tools to design and analyze experiments and to present results with great clarity and simplicity. This is because the evidential paradigm is designed around the single task that scientist most need to do: That is to objectively compare the support for alternative models. Master statisticians can, with their decades of training in classical statistics, successfully navigate the conceptual pitfalls of Mayo’s recasting of the Fisher and Neyman-Pearson methods, but for working scientists such as us, the evidential paradigm should be a great relief. ACKNOWLEDGEMENTS We thank the editors Malcolm R. Forster and Prasanta S. Bandyopadhyay for critical comments that have greatly improved this work. BIBLIOGRAPHY [Anderson and Burnham, 2002] D. R. Anderson and K. R. Burnham. Avoiding pitfalls when using information-theoretic methods. Journal of Wildlife Management 66:912-918, 2002. [Anderson et al., 2000] D. R. Anderson, , K. P. Burnham, and W. L. Thompson. Null hypothesis testing: Problems, prevalence, and an alternative. Journal of Wildlife Management 64:912923, 2000. [Bai et al., 1999] Z. D. Bai, C. R. Rao, and Y. Wu. Model selection with data-oriented penalty. Journal of Statistical Planning and Inference 77:103-117, 1999. [Barnard, 1949] G. A. Barnard. Statistical Inference. Journal of the Royal Statistical Society, Series B 11:115-149, 1949. [Berger, 1985] J. O. Berger. Statistical decision theory and Bayesian analysis. 2nd edition. Springer-Verlag., New York, 1985. [Berger and Wolpert, 1988] J. O. Berger and R. L. Wolpert. The Likelihood Principle. 2nd edition. Springer-Verlag, New York, 1988. [Berry, 2007] D. A. Berry. The difficult and ubiquitous problems of multiplicities. Pharmaceutical Statistics 6:155-160, 2007. [Blume, 2002] J. D. Blume. Likelihood methods for measuring statistical evidence. Statistics in Medicine 21:2563-2599, 2002. [Blume and Peipert, 2003] J. Blume and J. F. Peipert. What your statistician never told you about P-values. Journal of the American Association of Gynecologic Laparoscopists 10:439444, 2003. [Blume, 2008] J. D. Blume. How often likelihood ratios are misleading in sequential trials. Communications in Statistics-Theory and Methods 37:1193-1206, 2008. [Boik, 2004] R. J. Boik. Commentary. on Why Likelihood? by Forster and Sober. Pages 167180 in M. L. Taper and S. R. Lele, editors. The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. University of Chicago Press, Chicago, 2004. [Box, 1979] G. E. P. Box. Robustness in the strategy of scientific model building, in R. L. Launer and G. N Wilkinson (eds.), Robustness in Statistics New York: : Academic Press, 201–236, 1979. [Burnham and Anderson, 2002] K. P. Burnham and D. R. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. 2nd edition. Springer-Verlag, New York, 2002.
530
Mark L. Taper and Subhash R. Lele
[Cartwright, 1999] N. Cartwright. The Dappled World. A Study of the Boundaries of Science. Cambridge University Press Cambridge, 1999. [Cox, 2004] D. R. Cox. Commentary on The Likelihood Paradigm for Statistical Evidence by R. Royall. Pages 119-152 in M. L. Taper and S. R. Lele, editors. The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. University of Chicago Press, Chicago, 2004. [Davison and Hinkley, 1997] A. C. Davison and D. V. Hinkley. Bootstrap Methods and their Application. Cambridge University Press, Cambridge, UK, 1997. [Davison et al., 1992] A. C. Davison, D. V. Hinkley, and B. J. Worton. Bootstrap Likelihoods. Biometrika 79:113-130, 1992. [Demets and Lan, 1994] D. L. Demets and K. K. G. Lan. Interim analysis - the alpha-spending function approach. Statistics in Medicine 13:1341-1352, 1994. [Edwards, 1992] A. W. F. Edwards. Likelihood. Expanded Ed. Johns Hopkins University Press, Baltimore, 1992. [Efron and Tibshirani, 1993] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, London, UK, 1993. [Fisher, 1912] R. A. Fisher. On an absolute criterion for fitting fequency curves. Messeng. Math 41:155-160, 1912. [Fisher, 1921] R. A. Fisher. On the “probable error” of a coefficient of correlation deduced from a small sample. Metron 1:3-32. 1921. [Fisher, 1922] R. A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London Series A 222:309-368, 1922. [Fisher, 1973] R¿ A. Fisher. Statistical methods and scientific inference. 3rd edition. Hafner, New York, 1973. [Forster, 2002] M. R. Forster. Predictive accuracy as an achievable goal of scienced”, Philosophy of Science 69 (3):S124-S134, 2002. [Forster, 2006] M. R. Forster. Counterexamples to a likelihood theory of evidence. Minds and Machines 16:319-338, 2006. [Forster and Sober, 2004] M. Forster and E. Sober. Why Likelihood? Pages 153-190 in M. L. Taper and S. R. Lele, editors. The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. The University of Chicago Press, Chicago, 2004. [Fraser, 1963] D. A. S. Fraser. On the Sufficiency and Likelihood Principles. J. Am. Stat. Assn 58:641-647, 1963. [Frigg, 2006] R. Frigg. Scientific Representation and the Semantic View of Theories. Theoria 55:49-65, 2006. [Gemes, 2007] K. Gemes. Verisimilitude and content, Synthese 154 (2):293-306, 2007. [Giere, 1988] R. Giere. Explaining Science. University of Chicago Press, Chicago, 1988. [Giere, 1999] R. N. Giere. Science without laws (Science and Its Conceptual Foundations). Chicago.: University of Chicago Press, 1999. [Giere, 2004] R. N. Giere. How models are used to represent reality, Philosophy of Science 71 (5):742-752, 2004. [Giere, 2008] R. N. Giere. Models, Metaphysics, and Methodology, in Stephan Hartmann, Luc Bovens and Carl Hoefer (eds.), Nancy Cartwright’s Philosophy of Science: Routledge. Goldman, A. I. 1986. Epistemology and Cognition. Harvard University Press, Cambridge, MA, 2008. [Goldman, 2008] A. I. Goldman. Reliabilism in E. N. Zalta, editor. The Stanford Encyclopedia of Philosophy (Fall 2008 Edition), http://plato.stanford.edu/archives/fall2008/ entries/reliabilism/. The Center for the Study of Language and Information (CSLI), Stanford University, Stanford. [Goodman, 2008] S. N. Goodman. A dirty dozen: Twelve P-value misconceptions. Seminars in Hematology 45:135-140, 2008. [Hacking, 1965] I. Hacking. Logic of statistical inference. Cambridge University Press., Cambridge, 1965. [Harris, 1974] J. Harris. Popper’s definition of ‘Verisimilitude’, The British Journal for the Philosophy of Science, 25: 160-166, 1974. [Hughes, 1997] R. I. G. Hughes. Models and Representation. Philosophy of Science (Proceedings) 64 325-336. 1997. [Jeffreys, 1961] H. Jeffreys. Theory of probability. Third edition. The Clarendon press, Oxford, 1961.
Evidence, Evidence Functions, and Error Probabilities
531
[Lele, 2004] S. R. Lele. Evidence Functions and the Optimality of the Law of Likelihood.in M. L. Taper and S. R. Lele, editors. The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. The University of Chicago Press, Chicago, 2004. [Lele et al., 2007] S. R. Lele, B. Dennis, and F. Lutscher. Data cloning: easy maximum likelihood estimation for complex ecological models using Bayesian Markov chain Monte Carlo methods. Ecology Letters 10:551–563, 2007. [Lindsay, 2004] B. G. Lindsay. Statistical Distances as Loss Functions in Assessing Model Adequacy. Pages 439-488 in M. L. Taper and S. R. Lele, editors. The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. The University of Chicago Press, Chicago, 2004. [Lindsay et al., 2007] B. G. Lindsay, M. Markatou, S. Ray, K. Yang, and S. Chen. Quadratic Distances On Probabilities: A Unified Foundation. Columbia University Biostatistics Technical Report Series 9, 2007. [Linhart and Zucchini, 1986] H. Linhart and W. Zucchini. Model Selection. John Wiley & Sons, 1986. [Mayo, 1996] D. G. Mayo. Error and the Growth of Experimental Knowledge. University of Chicago Press, Chicago, 1996. [Mayo, 2004] D. G. Mayo. An error-statistical philosophy of evidence. Pages 79-118 in M. L. Taper and S. R. Lele, editors. The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. The University of Chicago Press, Chicago, 2004. [Mayo and Cox, 2006] D. G. Mayo and D. R. Cox. Frequentist statistics as a theory of inductive inference. Pages 77-97 in Optimality: The 2nd Lehmann Symposium. Institute of Mathematical Statistics Rice University, 2006. [Mayo and Spanos, 2006] D. G. Mayo and A. Spanos. Severe testing as a basic concept in a Neyman-Pearson philosophy of induction. British Journal for the Philosophy of Science 57:323-357, 2006. [McCullagh and Nelder, 1989] P. McCullagh and J. A. Nelder. Generalized Linear Models. 2nd edn. Chapman and Hall, London.Miller, D. 1972. The Truth-likeness of Truthlikeness. Analysis. 33:50-55, 1989. [Miller, 1974a] D. Miller. On the Comparison of False Theories by Their Bases. The British Journal for the Philosophy of Science 25:178-188, 1974. [Miller, 1974b] D. Miller. Popper’s Qualitative Theory of Verisimilitude. The British Journal for the Philosophy of Science 25:166-177, 1974. [Miller, 2000] D. W. Miller. Sokal & Bricmont: Back to the Frying Pan. Pli 9:156-173, 2000. [Miller, 2006] D. W. Miller. Out Of Error: Further Essays on Critical Rationalism Ashgate, Aldershot, 2006. [Morgan, 1999] M. Morgan. Learning from Models, in M. Morrison and M. Morgan (eds.), Models as Mediators: Perspectives on Natural and Social Science, Cambridge: Cambridge University Press, 347-388, 1999. [Neyman and Pearson, 1933] J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypostheses. Philosophical Transactions of the Royal Society of London Series A 231:289-337, 1933. [Niiniluoto, 1998] I. Niiniluoto. Verisimilitude: The third period. British Journal for the Philosophy of Science 49:1-29, 1998. [Nisbet, 1980] R. Nisbet. History of the Idea of Progress. Heinemann, London, 1980. [Oddie, 2007] G. Oddie. Truthlikeness. The Stanford Encyclopedia of Philosophy (Fall 2008 Edition), http://plato.stanford.edu/archives/fall2008/entries/truthlikeness/. [Pickett et al., 1994] S. T. A. Pickett, J. Kolasa, and C. G. Jones. Ecological Understanding: The Nature of Theory and The Theory of Nature. Academic Press, San Diego, 1994. [Platt, 1964] J. R. Platt. Strong Inference. Science 146:347-353, 1964. [Ponciano et al., 2009] J. M. Ponciano, M. L. Taper, D. Dennis, and S. R. Lele. Inference for hierarchical models in ecology: Confidence intervals, hypothesis testing, and model selection using data cloning. Ecology 90:356-362, 2009. [Popper, 1963] K. Popper. Conjectures and Refutations: The Growth of Scientific Knowledge. London: Routledge. 1963. [Popper, 1935] K. Popper. Logik der Forschung. Vienna: Julius Springer Verlag, 1935. [Popper, 1959] K. Popper. The Logic of Scientific Discovery. London: Hutchinson. Original edition, 1959. (Translation of Logik der Forschung.)
532
Mark L. Taper and Subhash R. Lele
[Popper, 1976] K. Popper. A Note on Verisimilitude, The British Journal for the Philosophy of Science 27 (2):147-159, 1976. [Roush, 2006] S. Roush. Tracking Truth. Oxford University Press, Oxford, 2006. [Royall, 1986] R. M. Royall. The effect of sample-size on the meaning of significance tests. American Statistician 40:313-315, 1986. [ROyall, 1992] R. M. Royall. The elusive concept of statistical evidence Pages 405-418 in J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors. Bayesian Statistics 4. Oxford University Press, Oxford, 1992. [Royall, 1997] R. M. Royall. Statistical Evidence: A likelihood paradigm. Chapman & Hall, London, 1997. [Royall, 2000a] R. M. Royall. On the Probability of Observing Misleading Statistical Evidence. Journal of the American Statistical Association 95:760-780, 2000. [Royall, 2000b] R. M. Royall. On the probability of observing misleading statistical evidence — Rejoinder. Journal of the American Statistical Association 95:773-780, 2000. [Royall, 2004] R. M. Royall. The Likelihood Paradigm for Statistical Evidence. Pages 119-152 in M. L. Taper, and S. R. Lele, editors. The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. The University of Chicago Press, Chicago, 2004. [Savage, 1976] L. J. Savage. On rereading R. A. Fisher (with discussion). Annals of Statistics 42:441-500, 1976. [Schwarz, 1987] G. Schwarz. Estimating the dimension of a model. Annals of Statistics 6:461464, 1987. [Sellke et al., 2001] T. Sellke, M. J. Bayarri, and J. O. Berger. Calibration of p values for testing precise null hypotheses. American Statistician 55:62-71, 2001. [Strug and Hodge, 2006a] L. J. Strug and S. E. Hodge. An alternative foundation for the planning and evaluation of linkage analysis I. Decoupling ’error probabilities’ from ’measures of evidence’. Human Heredity 61:166-188, 2006. [Strug and Hodge, 2006b] L. J. Strug and S. E. Hodge. An alternative foundation for the planning and evaluation of linkage analysis II. Implications for multiple test adjustments. Human Heredity 61:200-209, 2006. [Taper, 2004] M. L. Taper. Model identification from many candidates. Pages 448-524 in M. L. Taper and S. R. Lele, editors. The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. The University of Chicago Press, Chicago, 2004. [Taper and Gogan, 2002] M. L. Taper and P. J. P. Gogan. The Northern Yellowstone elk: Density dependence and climatic conditions. Journal of Wildlife Management 66:106-122, 2002. [Taper and Lele, 2004] M. L. Taper and S. R. Lele. The nature of scientific evidence: A forwardlooking synthesis. Pages 527-551 in M. L. Taper and S. R. Lele, editors. The Nature of Scientific Evidence: Statistical, Philosophical and Empirical Considerations. The University of Chicago Press, Chicago, 2004. [Thompson, 2007] B. Thompson. The Nature of Statistical Evidence. New York: Springer, 2007. [Tich´ y, 1974] P. Tich´ y. On Popper’s definitions of verisimilitude, The British Journal for the Philosophy of Science, 25: 155-160, 1974. [Tukey, 1960] J. W. Tukey. Conclusions vs Decisions. Technometrics 2:423-433 1960. [van Fraassen, 1980] B. van Fraassen. The Scientific Image. Oxford University Press, Oxford, 1980. [van Fraassen, 2002] B. van Fraassen. The Empirical Stance. Yale University Press, New Haven and London, 2002. [Quine, 1951] W. V. O. Quine. Two dogmas of empiricism, The Philosophical Review 60:20-43, 1951. [Zucchinii, 2000] W. Zucchini. An Introduction to Model Selection. Journal of Mathematical Psychology 44:41-61, 2000. [Zwart, 2001] S. D. Zwart. Refined Verisimilitude. Springer. 2001.
Akaikean Paradigm
This page intentionally left blank
AIC SCORES AS EVIDENCE: A BAYESIAN INTERPRETATION Malcolm Forster and Elliott Sober
1
INTRODUCTION
Akaike [1973] helped to launch the field in statistics now known as model selection theory by describing a goal, proposing a criterion, and proving a theorem. The goal is to figure out how accurately models will predict new data when fitted to old. The criterion came to be called AIC, the Akaike Information Criterion: AIC(M) = log-likelihood of L(M ) − k,1 where the model M contains k adjustable parameters and L(M ) is the member of M obtained by assigning to the adjustable parameters in M their maximum likelihood values. Akaike’s theorem is that AIC is an unbiased estimator of predictive accuracy, given some assumptions that are widely applicable.2 The three parts of this achievement should be kept separate. Not conflating the goal and the criterion is important, since criteria other than AIC might do a better job in some circumstances in achieving the goal. The criterion and the theorem also need to be distinguished, since, in general, the fact that an estimator is unbiased does not suffice to show that it should be used. The theorem that Akaike proved made it natural to understand AIC as a frequentist construct. AIC is a device for estimating the predictive accuracy of models just as a kitchen scale is a device for estimating the weight of objects. Bayesians assess an estimator by determining whether the estimates it generates are probably true or probably close to the truth. Evaluating how probable it is that a melon weighs two pounds, given that the scale says that it weighs two pounds, requires that one have a prior probability for the melon’s weighing two pounds. Akaike’s theorem says nothing about prior or posterior probabilities, so there was no reason to think of AIC in Bayesian terms. Rather, what Akaike did was what 1 Notice that AIC scores are negative numbers, with smaller likelihoods being “more negative” (i.e., farther from zero). Thus, here we adopt the convention used in [Forster and Sober, 1994] wherein higher AIC scores (those closer to zero) are better — they indicate a higher degree of predictive accuracy. In the statistics literature the AIC score is usually represented as -2 times the AIC formula we use here, so the opposite convention is followed – higher AIC scores are worse, because these scores indicate that the fitted model is more distant from the truth. 2 See [Sakamoto et al., 1986] for a thorough explanation and proof of Akaike’s theorem.
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
536
Malcolm Forster and Elliott Sober
frequentists generally do when they assess an estimator. They establish one or more of its “long-run operating characteristics.” The “long-run” average of an unbiased estimator of a quantity is, by definition, centered on the quantity’s true value. If you repeatedly weigh an object on an unbiased scale, the average of all the readings of the scale converges to the object’s true weight. On any given occasion, the scale’s reading may be too high or too low. To say that an estimator is unbiased leaves open what its variance is. This fact about Akaike’s theorem is sometimes taken to cast doubt on AIC. But here we must be careful — there may be more to AIC than Akaike’s theorem established. In fact, AIC isn’t just an unbiased estimator. The expected squared error of an unbiased estimator is strictly less than that of any other estimator that differs from it by a constant.3 AIC is in this respect a better estimator of predictive accuracy than BIC, the Bayesian Information Criterion first derived by Schwarz [1977]. BIC is both a biased estimator of predictive accuracy and has a higher expected squared error. To be sure, BIC was not developed as a method for estimating predictive accuracy, but rather was formulated as a method for estimating a model’s average likelihood. Nonetheless, the point is that there is more to be said on behalf of AIC than that it is unbiased. This does not prove that AIC is the best of all possible estimators of predictive accuracy. However, we suggest that estimators should be evaluated in the same way that empirical scientific theories are. Rather than asking whether general relativity is the best of all possible theories, we should ask whether it is better than the alternative theories that have been identified. The same goes for AIC. The goal of this paper is not to provide a full-scale assessment of AIC, but to show that it is an estimator whose estimates should be taken seriously by Bayesians, its frequentist pedigree notwithstanding.4 Frequentists often maintain that the question of how an individual estimate should be interpreted is meaningless — that the only legitimate question concerns the long-term behavior of estimators. Bayesians reply that both questions are important and that the interpretation of individual estimates is pressing in view of the fact that a given estimate might be produced by any number of different estimation procedures (see, for example, [Howson and Urbach, 1993]). We agree with Bayesians on this point and so we will pose a question about individual estimates that frequentists typically disdain — do AIC scores provide evidence concerning how predictively 3 Proof : Consider an estimator that differs from AIC by a constant c, and denote the true predictive accuracy a model has by AIC∗ . AIC is an unbiased estimate of AIC∗ if and only if E [AIC] = AIC∗ , where E is “expected value.” We may now calculate thehexpected squared error i
of the new estimator using well-known properties of expected values: E (AIC + c − AIC∗ )2 = h i E (AIC − AIC∗ )2 + c2 . The “cross-term” is zero because AIC is unbiased. Therefore, the expected squared error of AIC is strictly less than the expected squared error of any estimator that differs from it by a non-zero constant. This includes BIC and any other estimator that is equal to the log-likelihood plus some penalty for complexity. 4 By “frequentists” we are referring to the classical school of statisticians led primarily by Neyman and Pearson, and Fisher.
AIC Scores as Evidence: a Bayesian Interpretation
537
accurate a model will be? 2
ESTIMATES AS EVIDENCE
When Bayesians seek to explain why a positive pregnancy test is evidence that the person taking the test is pregnant, they inevitably refer to the Law of Likelihood [Hacking, 1965]: Observation O favors hypothesis H1 over hypothesis H2 if and only if P (O|H1 ) > P (O|H2 ). This concept of “favoring” ignores the values of prior probabilities and focuses exclusively on the evidential meaning of the present observations, which is said to be captured by the likelihoods of the competing hypotheses. The odds version of Bayes’s theorem makes it clear that, for Bayesians, likelihoods are the sole vehicle by which the present observations can change one’s degree of belief in the hypotheses: P (O|H1 ) P (H1 ) P (H1 |O) = × P (H2 |O) P (O|H2 ) P (H2 ) The ratio of posterior probabilities differs from the ratio of priors precisely when the likelihood ratio differs from unity. And the more the likelihood ratio differs from unity, the greater the transformation that the observations achieve. The Law of Likelihood applies to point estimates of a continuous parameter just as much as it does to the presence or absence of a dichotomous trait like pregnancy. For example, if a thermometer, applied to object o, yields a reading R(o) = x, this is evidence that o’s temperature, T (o), is y rather than x precisely when (T1 )
P [R(o) = y|T (o) = y] > P [R(o) = y|T (o) = x].
If AIC scores are like thermometer readings in this respect, an AIC score of x is evidence that the model has a predictive accuracy of x rather than y precisely when (A1 )
P [AIC(M ) = y|P A(M ) = y] > P [AIC(M ) = y|P A(M ) = x],
where PA(M ) denotes the predictive accuracy of model M . Akaike’s theorem, that AIC is an unbiased estimator of predictive accuracy, does not entail (A1 ) any more than a thermometer’s being an unbiased estimator of temperature entails (T1 ). For suppose a thermometer works like this: when an object has a temperature of y, the thermometer has 90% chance of reading (y − a) and a 10% chance of reading (y + b). The thermometer is an unbiased estimator of temperature precisely when y = (0.9)(y − a) + (0.1)(y + b),
538
Malcolm Forster and Elliott Sober
which is true if and only if a/b = 1/9. Suppose, for example, that a = 10 and b = 90. Then an object whose true temperature is +3 has a 90% chance of having the thermometer say its temperature is −7 and a 10% chance of having the thermometer say its temperature is +93. Since the average (expected) reading is (0.9)(−7)+(0.1)(93) = 3, the thermometer is unbiased. Yet, if an object’s temperature is 3, the probability is zero that the thermometer will say that its temperature is 3. If the thermometer produces a reading of x, the maximum likelihood estimate of the object’s temperature is (x + 10). This unbiased thermometer has an asymmetric error distribution; its readings have a higher probability of being too low than too high. This is why (T1 ) is false. The following proposition is true instead: (T2 )
P [R(o) = y|T (o) = y + 10] = 90% > P [R(o) = y|T (o) = x], for all x, y such that y + 10 6= x
The thermometer’s readings provide evidence about temperature because (T2 ) is true; the notion of evidence used here is the one sanctioned by the Law of Likelihood. 3
DIFFERENCES IN AIC SCORES
The reason we have labored the point that an unbiased estimator can have an asymmetric error distribution, and that this does not prevent its estimates from providing evidence, is that AIC is both an unbiased estimator and has an asymmetric error distribution. The proposition (A1 ) is false, but AIC scores still provide evidence about the predictive accuracies of models. Consider two models, M1 and M2 , where the first is nested in the second, meaning that M1 is a special case of M2 . Suppose that the AIC score of the second model is y units larger than the AIC score of the first. It turns out that this difference in AIC scores does not obey the following principle: (A2 ) P [AIC(M2 ) − AIC(M1 ) =y|P A(M2 ) − P A(M1 ) = y] > P [AIC(M2 ) − AIC(M1 ) = y|P A(M2 ) − P A(M1 ) = x], for all x, y such that x 6= y. The falsity of (A2 ) is shown by the different curves in Figure 1, which represents two nested models that differ by one in the number of adjustable parameters they have. Each curve in Figure 1 pertains to a difference in AIC scores; differences in AIC scores are observations, akin to the observation that one object has a higher thermometer reading than another. For any one observation, different hypotheses about the difference in the two models’ predictive accuracies confer different probabilities on that observation. Each curve in Figure 1 is in this sense a likelihood function. The derivation of these curves for the general case of a pair of nested
AIC Scores as Evidence: a Bayesian Interpretation
539
models that differ by k in their number of adjustable parameters is given in the Appendix. P AIC y PA x
0.3 AIC 0.7
0.2
AIC 0 0.1
AIC 1
0
5
AIC 5
AIC 10
10
PA
15
20
25
Figure 1. The graphs show the likelihood function for various values of the statistic ∆AIC. The meta-model describes how the difference in predictive accuracy of two nested models (call them object models) will affect the difference in their AIC scores. The meta-model has one adjustable parameter ∆P A, and each curve shows the likelihood of the hypothesis corresponding to the value of ∆P A read from the horizontal axis. In this case, ∆P A is the difference in predictive accuracy of two object models that differ in the number of adjustable parameters by only 1. Notice in Figure 1 that the peak of the likelihood function for the observation that AIC(M2 ) − AIC(M1 ) = y does not correspond to the hypothesis that P A(M2 ) − P A(M1 ) = y. Rather, the maximum likelihood hypothesis is that P A(M2 ) − P A(M1 ) = x, for some number x that is greater than y. This is easiest to see in Figure 1 by looking at the case where ∆AIC = 1. This value is made more probable by ∆P A = 1.5 than it is by ∆P A = 1. The same point holds for the curves depicted in Figure 2, which again describes a pair of nested models, but this time one model has 10 more adjustable parameters than the other. The following generalization holds. If AIC(M2 ) − AIC(M1 ) > 0, then there exists a positive number x such that (A3 ) P [AIC(M2 )−AIC(M1 ) = y|P A(M2 ) − P A(M1 ) = x] > P [AIC(M2 ) − AIC(M1 ) = y|P A(M2 ) − P A(M1 ) = z], for all z 6= x. AIC differences therefore favor some hypotheses about predictive accuracy over
540
Malcolm Forster and Elliott Sober
others, assuming that the concept of evidential favoring is given by the Law of Likelihood. P AIC y PA x 0.08 AIC 7
0.06 AIC 0
AIC 5
0.04
AIC 15
AIC 30
0.02
0
10
20
30
40
50
PA
Figure 2. The graphs show the likelihood function for various values of the statistic ∆AIC. As in Figure 1, the meta-model has one adjustable parameter ∆P A, and each curve shows the probability of the statistic for various values of the parameter. In this case, ∆P A is the difference in predictive accuracy of two object models that differ in the number of adjustable parameters by 10. There is a second interpretation of the results described in Figures 1 and 2 that Bayesians can extract. It is obtained by using Royall’s [1997] concept of a likelihood interval. Royall supplements the Law of Likelihood by proposing a definition of what it means for an observation to provide strong evidence favoring H1 over H2 ; he suggests that this is true precisely when the likelihood ratio is at least 8. Royall uses this convention to draw an interval around the maximum likelihood point on a likelihood function that has the following property: the maximum likelihood value has a likelihood that is at least 8 times as large as that possessed by any point that is outside the interval. Interpreted in this way, some of the observations described in Figures 1 and 2 provide strong evidence favoring the maximum likelihood estimate (which is positive) over any negative estimate.5 5 In Figure 2, ∆AIC = −7 provides strong evidence that ∆P A is negative, although this does not happen in Figure 1.
AIC Scores as Evidence: a Bayesian Interpretation
541
Again, this fails to be true when the difference in AIC scores is very small. A further assimilation of AIC scores to the Bayesian framework can be achieved by adding a prior distribution over differences in predictive accuracies. This, plus the likelihood function associated with the observed difference in AIC scores, allows one to calculate the posterior probability (density) that one model’s predictive accuracy exceeds another’s by x. One also can calculate the posterior probability that the difference in predictive accuracies is positive. There is a common puzzle about Bayesianism that has been clarified here. Bayesians often talk about comparing the probabilities of nested models. But, in practice, they compare nested models in terms of the ratio of their likelihoods, the Bayes factor, as in [Schwarz, 1978], which may be viewed as an application of LL. This makes sense because the Bayes factor can favor either model. Unfortunately, this comparison contradicts a more fundamental Bayesian comparison in terms of posterior probabilities if the first model entails the second, for then the probability of the first model can never be greater than the second. We have shown that the problem disappears as soon as Bayesians focus on the predictive accuracy of the models, rather than their truth. For now it is possible for the predictive accuracy of the first model to be higher than the second model, even if it is impossible for the first model to be true while the second model is false. Bayes factors are no longer involved in the comparison; now we are comparing meta-hypotheses about predictive accuracy. This shows why Bayesians need not be threatened, or puzzled, when they see scientists comparing models that are related by strict logical entailment.6
4
CONCLUSION
AIC began life with Akaike’s [1973] theorem, which established that AIC is an unbiased estimator of predictive accuracy. Because this proof described a longrun operating characteristic of the estimator, and not the evidential meaning of the particular estimates that the estimator might provide, and also because prior and posterior probabilities were not part of the framework used by Akaike and his school (see, for example, [Sakamoto et al., 1986]), AIC came to be viewed as a frequentist construct. However, these facts about earlier defenses of AIC do not establish that AIC scores are meaningless from a Bayesian point of view. We have argued that AIC scores provide evidence concerning the predictive accuracies of models in the sense of “evidence” sanctioned by the Law of Likelihood, which is a central Bayesian principle.7 AIC scores are no more essentially tied to frequentism than thermometer readings are. 6 We
are grateful to Prasanta Bandyopadhyay for suggesting this point. result reported here pertains to any pair of nested models. The case of non-nested models remains to be investigated. 7 The
542
Malcolm Forster and Elliott Sober
APPENDIX Suppose that we are interested in corn crop yields in a field that has been divided into numerous plots. The plots are subjected to different conditions, they have different drainage properties, different exposures to the wind, slightly different soils, and perhaps they are treated and irrigated differently as well. The question is whether these different conditions affect the height of the corn plants. Assume that the height of each plant i is represented by a random variable Xi whose value is denoted by xi . Also to simplify the mathematics, we assume that the random variables for plants in the same plot are independent and identically distributed with a normal distribution and unit variance; Xi ∼ N (µj(i) ∗, 1), where j(i) denotes the plot number in which plant i is found. The * indicates that this is the true value of the mean, not something hypothetical. Sometimes we will just write µj ∗. The various null hypotheses one might consider here are false because they falsely assert that two or more plots have the same mean, whereas in fact all the µj ∗ are different (although only slightly in some cases). A typical hypothesis is false, but it’s not false that it has some predictive accuracy. The model under test assigns a degree of predictive accuracy to some model about corn yields. This meta-model attributes a property to the object model. The object model is false, but the meta-model may be true. Suppose that there are only three plots, with mean values µ∗1 , µ∗2 , µ∗3 . An arbitrary hypothesis asserts that the mean values are µ1 , µ2 , µ3 for some particular triple of numbers. A model will leave these as adjustable parameters, but possibly add constraints such as µ1 = µ2 . Each such constraint will reduce the number of adjustable parameters by one. For example, the model that (falsely) asserts that the three plots have the same mean yield has one adjustable parameter. It asserts for all plants, Xi ∼ N (µ, 1), for some unknown parameter µ. Let’s say that there are n1 plants sampled from plot 1, n2 plants from plot 2, n3 plants from plot 3, with a total of n = n1 + n2 + n3 plants. The log-likelihood of any particular hypothesis in the object model (picked out by a particular triple of numbers µ1 , µ2 , µ3 ) is l(µ1 , µ2 , µ3 )
= − n2 log 2π − n P
i=n1 +n2 +1
1 2
n1 P
i=1
2
(xi − µ1 ) − 2
(xi − µ3 ) .
1 2
n1P +n2
i=n1 +1
2
(xi − µ2 ) −
1 2
Since µ1 , µ2 , µ3 are constants, −2l(µ1 , µ2 , µ3 ) − n log 2π is the sum of the squares of n normal random variables of unit variance and mean µ∗1 − µ1 , µ∗2 − µ2 , or µ∗3 − µ3 , depending on the plot in which the plant is sampled. That means that −2l(µ1 , µ2 , µ3 ) − n log 2π is a chi-square random variable with n degrees of freedom, and a non-centrality parameter equal to the sum of the squares obtained by substituting the mean value for each random variable in the quadratic. That is, −2l(µ1 , µ2 , µ3 ) − n log 2π ∼ χ2 (n, λ) ,
AIC Scores as Evidence: a Bayesian Interpretation
where
2
2
543
2
λ = n1 (µ∗1 − µ1 ) + n2 (µ∗2 − µ2 ) + n3 (µ∗3 − µ3 ) . Since the mean of a chi-square random variable is equal to the number of degrees of freedom (n in this case) plus the non-centrality parameter, we can calculate the predictive accuracy of any hypothesis, where the predictive accuracy of such a hypothesis is, by definition, its expected log-likelihood. P A(µ1 , µ2 , µ3 )
n = − 2 (log 2π + 1) − 2 2 2 1 ∗ ∗ ∗ . 2 n1 (µ1 − µ1 ) + n2 (µ2 − µ2 ) + n3 (µ3 − µ3 )
Note that this is the predictive accuracy of a “point” hypothesis or non-composite hypothesis. Now consider the hypothesis µ1 = µ2 = µ3 = µ, which has one adjustable parameter. This is a composite hypothesis, not a point hypothesis. Since this distinction is important, we will mark it by always referring to a composite hypothesis as a model, and never referring to point hypotheses as models. This is because the equality above does not specify any particular numerical value for µ. The question at hand is: How should we define the predictive accuracy of a model? If we fit a model to the actual set of data, we get a point hypothesis, namely the maximum likelihood hypothesis that belongs to the model. Its predictive accuracy is already well defined, but we do not want to define the predictive accuracy of the model directly in terms of this number because the actual data set may not be typical. So, we imagine that we repeatedly generate data sets of the same size as the actual data set, and define the predictive accuracy of the model as the average predictive accuracy of the maximum likelihood hypotheses generated in this random way. The predictive accuracy of a model is therefore the predictive accuracy of a “typical” maximum likelihood hypothesis. From a mathematical point of view, things just got complicated because there is now a double expectation involved in calculating the predictive accuracy of a model. First, we take the general formula for P A(µ, µ, µ), which is itself defined as an expected value of a particular point hypothesis, µ1 = µ2 = µ3 = µ, where we are thinking of µ as a fixed number. Recall that the predictive accuracy, PA, is defined here as the expected log-likelihood of the point hypothesis relative to some newly generated data set, which we may call test data. How well the hypothesis fits this new data is naturally thought of as a measure of its predictive accuracy. The next step is to put µ equal to the maximum likelihood estimate of µ determined by the actual data set, which is denoted by µ ˆ. Thus, we getP A(ˆ µ, µ ˆ, µ ˆ). But µ ˆ may not be typical, so we want to average P A(ˆ µ, µ ˆ, µ ˆ) over all possible values of µ ˆ that we would obtain if we were to draw data from the same (unknown) probability distribution. This average is defined as the expected value of P A(ˆ µ, µ ˆ, µ ˆ) as determined by the value of µ ˆ that would be obtained from any data set that could be used to initially fit the model. We think of this data set as a calibration data set because it is used to fix the values of adjustable parameters. The calibration data is conceptually quite different from a test data set. To define the notion of
544
Malcolm Forster and Elliott Sober
prediction, these data sets must be kept separate, and it is therefore essential that we define the predictive accuracy of a model as a double expectation. The maximum log-likelihood of the model is found by finding the maximum likelihood estimate of the adjustable parameter µ, which we denote by µ ˆ, and then writing down the log-likelihood of the particular hypothesis corresponding to this value of µ. The answer is, clearly, n
1X n 2 (xi − µ ˆ) . maximum log-likelihood = l(ˆ µ) = − log 2π − 2 2 i=1 Now think of µ ˆ as a quantity that varies randomly as we consider different sets of calibration data. It is well known (e.g., see [Hogg and Craig, 1978, p. 279]) that n X i=1
2
(xi − µ ˆ) ∼ χ2 (n − 1, λ∗ )
is a non-central chi-square distribution with n− 1 degrees of freedom, whose noncentrality parameter is calculated by substituting the mean values for each random variable in the quadratic. The non-centrality parameter is therefore 2
2
2
λ∗ = n1 (µ∗1 − µ∗ ) + n2 (µ∗2 − µ∗ ) + n3 (µ∗3 − µ∗ ) , where
n1 ∗ n2 ∗ n3 ∗ µ + µ + µ . n 1 n 2 n 3 Notice what went on here. We began with an object model, and considered its log-likelihood relative to the actual data. Then we introduced a meta-hypothesis about how the log-likelihood would behave statistically if it were determined by a new data set of the same size. In other words, we have treated µ ˆ as well as l(ˆ µ) as random variables, and we have constructed a hypothesis about its statistical behavior. The meta-hypothesis and the object model are different (one may be true, while the other is false). Now let’s consider the general case in which there are n plants sampled from m plots (n > m) in numbers n1 , n2 , . . . , nm , where the hypothesis under consideration asserts that the means of two or more plots are equal. We can think of the plots being partitioned into clusters; sometimes there may be one plot in a cluster, or two plots in a cluster, and so on. Obviously, there are fewer clusters than plots, except in the special case in which there is only one plot in any cluster. In that case, the model has m adjustable parameters. In the other extreme case, in which all the plots are in one cluster, there is one adjustable parameter. In general, there are k adjustable parameters, where k is the number of clusters introduced by the hypothesis; clearly {1 ≤ k ≤ r}1 ≤ k ≤ m. Let’s denote the maximum log-likelihood of the model with k adjustable parameters by l(ˆ µ1 , µ ˆ1 , . . . , µ ˆk ). Then µ∗ ≡
−2l(ˆ µ1 , µ ˆ1 , . . . , µ ˆk ) − n log 2π ∼ χ2 (n − k, λ∗ ) ,
AIC Scores as Evidence: a Bayesian Interpretation
545
where the non-centrality parameter λ∗ is calculated in the same way as before. The ∗ reminds us that this is a meta-hypothesis about how the log-likelihood of the object model behaves. According to Akaike’s theorem, AIC = l(ˆ µ1 , µ ˆ1 , . . . , µ ˆk ) − k is an unbiased estimate of the predictive accuracy of the model M . That is, E ∗ [AIC (M )] = AIC∗ (M ), where the expectation is taken with respect to the true distribution. But we have just learned that AIC (M ) ∼ − 21 χ2 (n − k, λ∗ ) − 21 n log 2π − k, where the mean value of χ2 (n − k, λ∗ ) is n − k + λ∗ . Therefore, P A (M ) = − 12 (n + k + λ∗ ) − 12 n log 2π. Observe that P A (M ) becomes infinitely large as the number of data, n, tends to infinity. It is not standard in statistics to estimate a quantity that grows infinitely large as the sample size increases, and this has led to considerable confusion about the properties of AIC (see [Forster, 2000; 2001], for detailed discussions). But, since this issue is not being discussed here, we shall refer to the per datum predictive accuracy when the need arises, and reserve PA to refer to the per datum predictive accuracy times the number of data n. Therefore, our result is written P A(M ) = −C − 21 λ∗ − 21 k, where C = 21 (log 2π + 1) n ≃ 1.41894 n. So defined, predictive accuracy is the expected log-likelihood for newly sampled data that are generated by the true hypothesis, since the true hypothesis has no adjustable parameters and zero noncentrality parameter (the postulated mean values are the true values).8 Thus, −C is the highest predictive accuracy that could be achieved by any point hypothesis in the model. The constant term is unimportant for our purposes because it is the same for every model. The other terms are negative because the predictive accuracy is expected log-likelihood, and the log-likelihood is negative whenever the likelihood is less than 1. Nevertheless, it is important to understand that higher predictive accuracies (less negative) are better than lower predictive accuracies (more negative). Firstly, the higher the value of 12 λ∗ , the lower the predictive accuracy. To interpret this fact, notice that 12 λ∗ /n is independent of the total number of data because λ∗ grows proportionally to n. In our simple case, we could write λ∗ /n =
n2 ∗ n3 ∗ n1 ∗ 2 2 2 (µ1 − µ∗ ) + (µ2 − µ∗ ) + (µ3 − µ∗ ) , n n n
so if the proportions n1 /n, n2 /n, and n3 /n are held fixed, then λ∗ /n is fixed. 1 ∗ Therefore − C + 2 λ n is the per datum predictive accuracy achieved by the 8 Equivalently,
if the non-centrality parameter is not zero, then the hypothesis is false.
546
Malcolm Forster and Elliott Sober
model as the number of data tends to infinity. Alternatively, we could say that − C + 12 λ∗ is the predictive accuracy that could be obtained if the parameter estimation were free of error. It is convenient to refer to C + 12 λ∗ as the bias of a model. Bias is bad and predictive accuracy is good, so the smaller the bias of a model, the higher the predictive accuracy that it could potentially achieve. For small data sets especially, the term 21 k is important because it measures the effect of the sampling errors involved in estimating the parameters of the model. The fewer the adjustable parameters, the greater the amount of data that can be used to estimate the value of each parameter. The negative effect that estimation errors has on the predictive accuracy of the maximum likelihood hypothesis is referred to as the variance of the model. Variance is bad and predictive accuracy is good, so they have opposite signs. In summary, the predictive accuracy of a model can be defined in terms of its bias and variance, thus P A = − (bias + variance) , where bias and variance are positive quantities. The minus sign reflects the fact that both bias and variance are bad for predictive accuracy. Bias measures the predictive accuracy of the best hypothesis in the model (the hypothesis that would be selected if there were no variance) while variance measures the loss in predictive accuracy that comes about by estimation errors. Variance is reduced in simpler models, but the bias is greater. So, maximizing predictive accuracy (if that is the goal) involves a tradeoff between bias and variance. The predictive accuracy of a model is unknown because the model bias is unknown, and that’s because the non-centrality parameter of the generating distribution is unknown. But hypotheses about its value can be tested, its value can be estimated, and confidence intervals can be inferred. It may seem that our work is done, but we also want to know how differences in AIC scores provide evidence about differences in predictive accuracy. This is achieved by showing that differences in the maximum log-likelihoods are also governed by a non-central chi-square distribution. Or at least, we can show this when the models are nested (i.e., when one model is a special case of the other). First, we need the following theorem (which is a special case of a well known theorem proved, for example, in Hogg and Craig, 1978, p. 279). THEOREM 1. Let Q1 = Q2 + Q, where Q1 , Q2 , and Q are 3 random variables that are quadratic forms in n mutually stochastically independent random variables that are each normally distributed with means µ∗1 , µ∗2 , . . . , µ∗n and unit variance. Let Q1 and Q2 have chi-square distributions with degrees of freedom r1 and r2 , respectively, and let Q be non-negative. Then: (a) Q2 and Q are mutually stochastically independent, and (b) Q has a chi-square distribution with r = r1 − r2 degrees of freedom.
AIC Scores as Evidence: a Bayesian Interpretation
547
To apply the Theorem, note that Q2 = −2l(ˆ µ1 , µ ˆ2 ) − n log 2π =
n1 X i=1
2
(xi − µ ˆ1 ) +
n2 X j=1
2
(xj − µ ˆ2 )
is a chi-squared random variable with n− 2 degrees of freedom. Then note that Q1 = −2l(ˆ µ) − n log 2π =
n X
k=1
2
(xk − µ ˆ)
is a chi-squared random variable with n− 1 degrees of freedom. A long, but straightforward, calculation shows that ∆l = l(ˆ µ1 , µ ˆ2 ) − l(ˆ µ) =
n2 n1 2 2 (ˆ µ−µ ˆ1 ) + (ˆ µ−µ ˆ2 ) . 2 2
From this, we prove that Q1 = Q2 + 2∆l. From the Theorem, it follows that 2∆l is a chi-square random variable with one degree of freedom, which has a non-centrality parameter, call it ∆λ, found by substituting the mean values for each variable in the quadratic. That is, 2
2
∆λ = n1 (µ∗ − µ∗1 ) + n2 (µ∗ − µ∗2 ) , where, by definition,
n1 ∗ n2 ∗ µ + µ . n 1 n 2 So, the evidential problem reduces to something simple. The statistic 2∆l = 2 (l2 − l1 ) provides evidence about the predictive accuracy of M2 compared to M1 . Recall that models with lower numbers of adjustable parameters have higher biases, which confirms that ∆λ has the same sign as ∆bias. However, in order to keep the difference in the variances positive, we need to define µ∗ =
∆variance = variance (M2 ) − variance (M1 ) = k2 − k1 = ∆k, because the more complex model has a greater variance than the simpler model. Now define the difference in predictive accuracy as the advantage that the complex model has over the simpler model. That is, ∆P A = P A(M2 ) − P A(M1 ). Then the tradeoff between bias and variance is expressed by the formula ∆P A = ∆bias − ∆variance = 12 ∆λ − 21 ∆k. Since the difference in AIC scores is straightforwardly calculated from ∆l, it too provides evidence for hypotheses about the difference in predictive accuracy. All
548
Malcolm Forster and Elliott Sober
we require is that any particular hypothesis about the value of ∆P A is associated with a particular chi-square distribution, which requires that we know the degrees of freedom and find the non-centrality parameter. The degree of freedom is ∆k, which is the difference in the number of adjustable parameters, whereas the noncentrality parameter is given by ∆λ = ∆k + 2∆P A. From this, one can make statistical inferences about differences in predictive accuracy from differences in AIC scores, for we know that 2∆l ∼ χ2 (∆k, ∆λ). Note that, since ∆λ ≥ 0, ∆P A ≥ −∆k/2. That is, simpler models are limited in the advantage in predictive accuracy they can have over more complex models. More interestingly, the advantage is constrained to become closer and closer to zero as the number of data increases. Thus, in the large sample limit, there is no difference in predictive accuracy. In summary, given any value of ∆P A, ∆l is the value of a random variable with the distribution 21 χ2 (∆k, ∆λ). So, ∆AIC is a random variable with the distribution 21 χ2 (∆k, ∆k + 2∆P A) − ∆k, whose expected value is ∆P A. Or, in other words, ∆AIC is a random variable with a known distribution whose expected value is ∆P A, which is the content of Akaike’s theorem. But this result goes far beyond Akaike’s theorem by providing the probability distribution that one can use whatever method of statistical inference about ∆P A one may wish to deploy. Here is a final remark about what it means for AIC to be an unbiased estimator of predictive accuracy. It means that if we fix ∆P A and resample the calibration data, we will get values of the statistic ∆AIC whose mean value is equal to ∆P A. It does not mean that for any value of ∆AIC, the mean value of ∆P A will be equal to ∆AIC. Not only would such a statement depend on the assignment of a Bayesian prior distribution over values of ∆P A, but in some cases there is no prior that could even make it true. To see this, suppose ∆P A has its lowest possible value of − 21 ∆k. The statistic ∆AIC will be higher in value sometimes and lower in value sometimes, which means that sometimes it will be lower than the lowest possible value of ∆P A, even though statisticians will still say that ∆AIC is an unbiased estimator of ∆P A. ACKNOWLEDGMENTS We thank Jim Hawthorne and Prasanta Bandyopadhyay for useful suggestions. BIBLIOGRAPHY [Akaike, 1973] H. Akaike. Information Theory as an Extension of the Maximum Likelihood Principle. In B. Petrov and F. Csaki (eds.), Second International Symposium on Information Theory. Budapest: Akademiai Kiado, pp. 267-281, 1973. [Forster, 2000] M. R. Forster. Key Concepts in Model Selection: Performance and Generalizability, Journal of Mathematical Psychology 44: 205-231, 2000.
AIC Scores as Evidence: a Bayesian Interpretation
549
[Forster, 2001] M. R. Forster. The New Science of Simplicity. In A. Zellner, H. A. Keuzenkamp, and M. McAleer (eds.) Simplicity, Inference and Modelling. Cambridge University Press, pp. 83-119, 2001. [Forster and Sober, 1994] M. R. Forster and E. Sober. How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions. British Journal for the Philosophy of Science 45: 1-36, 1994. [Hacking, 1965] I. Hacking. The Logic of Statistical Inference. Cambridge: Cambridge University Press, 1965. [Howson and Urbach, 1993] C. Howson and P. Urbach. Scientific Reasoning – the Bayesian Approach. Peru, IL: Open Court, 1993. [Royall, 1997] R. Royall. Statistical Evidence – a Likelihood Paradigm. Boca Raton: Chapman and Hall, 1997. [Sakamoto et al., 1986] Y. Sakamoto, M. Ishiguro, and G. Kitagawa. Akaike Information Criterion Statistics. New York: Springer, 1986. [Schwarz, 1978] G. Schwarz. Estimating the Dimension of a Model. Annals of Statistics 6: 461465, 1978.
This page intentionally left blank
Part IV
The Likelihood Principle
This page intentionally left blank
THE LIKELIHOOD PRINCIPLE Jason Grossman
1
INTRODUCTION
The likelihood principle (LP) is a normative principle of very great generality for evaluating statistical inference procedures. The LP can be proved from arguably self-evident premises1 ; indeed, it can be proved to be logically equivalent to these premises, as we will see. It tells us, to give a rough first pass, that inferences from data to hypotheses should depend on how likely the actual data are under competing hypotheses, not on how likely imaginary data would have been under a single “null” hypothesis nor on any other properties of merely possible data.2 The plan of this chapter is to motivate and prove a precise version of the LP, with a number of caveats; briefly mention some alternative versions; and reply to an interesting objection. Let’s start with an example (see Table 1). A standard significance test would reject hypothesis 1 at the 5% level, reasoning that if hypothesis 1 were true then the observation, with a probability of only 3%, would be rather unlikely. The sense in which 3% is to be considered unlikely, on this view, is of course that it is much less than 100%, which is the total probability of the top row of the table (since all possible observations are supposed to be represented, with the final cell representing a catch-all case). The importance of the likelihood principle is that it tells us that this is not the relevant comparison. Instead, we should be comparing what the various hypotheses say about what actually occurred. In this case, hypothesis 1 says that what occurred was unlikely (probability of 3%), but the other hypotheses say that it 1 the
weak sufficiency principle and the weak conditionality principle, defined below more precise statement of the LP is as follows. (I unpack this statement later in this chapter.) When analysing the result of an observation, provided that the choice of observations is not informative about our hypotheses, and provided that the loss (equivalently, utility) function (if we have one) is independent of the observation, inference procedures which make inferences about simple hypotheses should not be justified by appealing to probabilities assigned to observations which have not occurred, except for the trivial constraint that those probabilities place on the probability of the actual observation under the rule that the probabilities of exclusive events cannot add up to more than 1. Therefore the only component of p(x|h) which should be used to justify such inferences is the likelihood function p(xobs |h), where x ranges over the sample space, h ranges over the hypothesis space and xobs is the actual observation. 2A
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
554
Jason Grossman
actual observation
merely possible observation 1
merely possible observation 2
other possible observations
hypothesis 1
0.03
0.2
0.5
0.27
hypothesis 2
0.001
0.01
0.95
0.039
catch-all hypothesis
0.001
0.001
0.001
0.997
Table 1. was even more unlikely (total probability of 0.2%). The LP does not tell us how to compare these options — for example, whether to use other knowledge we might have about the hypotheses — and it is possible that a statistical method compatible with the LP will give the same answer as the standard significance test (reject hypothesis 1). But it cannot do so by using any of Table 1 except for the first column. And in general it is very common for methods compatible with the LP (most notably Bayesianism) to give different results from methods incompatible with the LP. A very informal argument in favour of the likelihood principle is that if the observation is much more likely on hypothesis 1 than it is on all the other hypotheses put together, as it is in this example, then that is prima facie evidence in favour of hypothesis 1. But on a significance test the observation counts against hypothesis 1. Something seems to be wrong here, and the LP puts its finger on exactly what is wrong: that the significance test takes into account observations which would have been very relevant to hypothesis 1 had they occurred, but which did not occur. All of the methods incompatible with the LP which are in common use are frequentist methods, which is to say that they all analyse tables such as the example table by considering whole rows. (It does not follow that all frequentist methods are incompatible with the LP. More on this below.) As we have seen, a significance test uses the single row representing the null hypothesis (see Table 2). Other frequentist methods may use more than one row, but they always use whole rows. In contrast, a method compatible with the LP always justifies its inferences by appeals to the single column representing the data actually observed (see Table 3). Before I state the mathematical apparatus which I need in order to present the LP properly, I would like to discuss informally the very basic contrast which the LP is meant to formalise. In the next two sections, I will consider two different things which we might want to do when we evaluate a statistical inference procedure: we might want to count the number of times (in different situations) it is right, on
The Likelihood Principle
555
actual observation
possible observation 1
possible observation 2
...
hypothesis H0
p0,a
p0,1
p0,2
...
hypothesis H1
p1,a
p1,1
p1,2
...
hypothesis H2
p2,a
p2,1
p2,2
...
.. .
.. .
.. .
..
.. .
.
Table 2. Data analysed by a frequentist method will use whole rows in some way
actual observation
possible observation 1
possible observation 2
...
hypothesis H0
p0,a
p0,1
p0,2
...
hypothesis H1
p1,a
p1,1
p1,2
...
hypothesis H2
p2,a
p2,1
p2,2
...
.. .
.. .
.. .
..
.. .
Table 3. Analysis compatible with the LP
.
556
Jason Grossman
the assumption that some hypothesis or other is true (frequentism); or we might want to compare what it says about various hypotheses in the same situation (as the LP advises).
1.1
One option: frequentism
One way to evaluate a statistical inference procedure is to see how often it gets the right answer. But what does “how often” mean? It might mean that we should work out the number of times we should expect a given inference procedure to get the right answer, in some set of hypothetical test cases. If we do this in the same way we would for a deductive inference procedure, we will start with some known true premises and see how often the inference procedure infers true (and relevant) conclusions from them. Now, before we can embark on such an evaluation, we have to decide what types of conclusions we want the statistical inference procedure to infer. Perhaps, if it is going to be a useful procedure, we want it to infer some general scientific hypotheses. We might then evaluate it by asking how often it correctly infers the truth of those hypotheses, given as premises some other general hypotheses and some randomly varying observational data. We can imagine feeding into the inference procedure random subsets of all the possible pieces of observational data, and we can calculate the proportion of those subsets on which it gets the right answer. This method is the frequentist or “error-rate” method.
1.2
Another option: the LP
It might seem as though the frequentist method were the only way of finding something loosely analogous to the logician’s method for testing deductive inferences. In order to see whether it is the only way, consider what information is available to us when we are getting ready to use a statistical inference procedure in real life. Some of our premises at that time will be general statements about the way the world is, of the nature of scientific hypotheses. The rest of our premises will be statements about specific observed phenomena. The distinction between these two — fuzzy though it inevitably is — is fundamental to the nature of statistical inference. The most common epistemological goal of science is to make inferences from the latter to the former, from observations to hypotheses. (Not that this is the only goal of science.) And in order for this to be statistical inference, the hypotheses must be probabilistic (and not deductively entailed by the premises). In short, when we need a statistical inference procedure it is because we have collected some data and we want to infer something from the data about some hypotheses. What we want to know in such a situation is how often our candidate statistical inference procedure will allow us to infer truths, and we want to calculate this by comparing its performance to the performance of other possible procedures in the same situation, with the same data as part of our premises. This is what we do when we follow the LP.
The Likelihood Principle
557
If frequentist methods were always compatible with the LP then we would have a large constituency of people who agreed on how to evaluate statistical inference procedures. Perhaps they would be right, and if so we could pack up and go home. In fact, it is possible for the two options to agree numerically in some specific cases. But I hope I have made it clear that they disagree about what the norms of inference are. The frequentist method is to evaluate the performance of an inference procedure only on (functions of subsets of) all the possible pieces of observational data, while the LP tells us to evaluate its performance only on the data actually observed. So the very concept of “the performance of an inference procedure” is a completely different animal according to these two options. 2
CALCULATION, JUSTIFICATION AND CLASSIFICATION
To repeat, the likelihood principle and frequentism are two doctrines about the justification of inference procedures. The LP may appear to rule out frequentist inference procedures by denying that they are justified. But there is a back door by which some of them can creep back in. Given a specific frequentist inference procedure, we can ask: can this inference procedure, which hitherto has been justified by frequentist means, also be justified by an argument compatible with the LP? It sometimes transpires that specific brands of frequentism and specific brands of LP-compatible inference, in specific contexts, give the same answers as each other. For example, many frequentist estimation procedures are compatible with many Objective Bayesian analyses (see the chapters on Objective Bayesianism in this volume and [Bayarri and Berger, 2004, pp.63]). (All Bayesian methods obey the LP.3 ) Unfortunately, since frequentism and LP inference are both broad churches there is little more that that can be said concisely about the extent to which they agree. Although the LP does not tell us how to do statistical inference, and although it does not completely rule out frequentist inference, it does rule out many specific inferences. Also, it allows us to categorise methods of statistical inference in a very natural and powerful way: a way which is more abstract and more general than the usual ways of classifying statistical theories. In particular, it divides theories of statistical inference into just two interesting, exhaustive and mutually exclusive categories. Note that the traditional polarisation of statistical theories into the Bayesian and the frequentist does not do this, since there are methods which are neither Bayesian nor frequentist, such as maximum likelihood estimation [Fisher, 1930a, p.531] and pivotal inference [Barnard, 1985].
3 All strictly Bayesian methods obey the LP [Jeffreys, 1961, p.57], [Lindley, 1965, p.59], [Good, 1983, pp.35–36, p.132], although a few methods with “Bayes” in their names contravene both Bayes’s Theorem and the LP (to some extent at least) by counting the same data twice. Examples of this are Empirical Bayes methods [Robbins, 1955; Lindley, 1983] and Objective Bayesian methods with priors chosen as a function of the likelihood function [Berger, 2006, pp.399-400].
558
Jason Grossman
Figure 1. All possible theories of statistical inference This diagram includes future, as yet unthought of, theories of statistical inference, as well as existing ones, assuming that future theories have roughly the same form as existing theories (i.e., that they are based on probabilistic equations formulated in terms of data and hypotheses). The LP also captures some of the most attractive features of Bayesianism, while leaving open the question of whether a subjective prior should be used (and leaving open the question of exactly what “subjective” means). Since it provides a lot of common ground between factions of Bayesians, the LP is a good, irenic starting point for agreement between them, and possibly between factions of philosophers of statistics in general.
3
TERMINOLOGY AND CAVEATS
In this section I define the terms I will need to state the LP properly, and state some important caveats. First of all, what do I mean by “inference procedures”? I will assume that we all know what a procedure (simpliciter) is. There are perhaps issues about ad hoc or ill-defined procedures, and about the difference (if any) between a deliberate experiment and a mere (perhaps accidental) observation, but I will leave those issues to one side for lack of space. The remaining question is what counts as an inference. Because the LP does not take into account a utility or loss function (see discussion of this below), the LP does not give us a decision theory. Because it does not take into account prior probabilities, it does not even give us a full theory of statistical inference. The LP is about the evidence afforded to hypotheses by data,
The Likelihood Principle
559
and only indirectly about our epistemic actions based on that evidence. [Royall, 2004] makes an important distinction between the questions What should I believe? What should I do? How should I interpret this body of observations as evidence? [Royall, 2004, p.122] The LP answers only the third question. It is about evidential inferences, or about how inferences should be made insofar as they rely on the evidence provided by data. Therefore in this chapter I use “inferences” in a narrow sense, to refer to any beliefs and partial (probabilistic) beliefs which are held or followed, and any actions which are taken, as a direct result of the evidence supplied by an observation.
3.1
Notation
As is standard, I use X and Y to denote random variables4 and p(F (x) | G(x)) as shorthand for p(F (X) = F (x) | G(Y ) = G(y)). H is the set of hypotheses under active consideration by anyone involved in the process of inference.5 Θ is a set (typically but not necessarily an ordered set) which indexes the set of hypotheses under consideration. I will always treat θ as an index on the whole set of hypotheses. In other words, (∀h ∈ H) (∃θ ∈ Θ : Hθ = h).6 4 Random variable is standard terminology in discussions of statistics, but it is slightly misleading. Fortunately, I will be able to do without discussing random variables most of the time; but not quite all the time. A random variable such as X is (famously) neither random nor a variable: it is a function which associates a real number (but not generally a unique one) with each possible observation. Typically, it is considered to be subject to the constraint that for all x ∈ R the set {y : X(y) ≤ x} is measurable according to a standard measure on R . Although X, a random variable, is not a variable, x, a possible value of X, is a variable. I write the set of possible values of x — in other words, the range of the random variable X — as X . 5 If we omitted this condition that H contain all the hypotheses under consideration, the likelihood principle could mistakenly require that we treat two observations as evidentially equivalent even though we know that one supports an important but unmentioned hypothesis more strongly than the other one does. Detailed examples illustrating this point are given in [Berger and Wolpert, 1988, pp.36–38] and elsewhere in the statistical literature. 6 Under what general circumstances can we be sure that there is an index set on H? First of all, if H is finite or countably infinite then of course it can be indexed. (Strictly speaking it should be provably countable by a constructive proof, but this is an unimportant detail.) Interestingly, if there is a countable number of observables, each with a countable number of possible states, H can be any subset of the set of all probability distributions over the possible observations, which is countable and thus can be indexed. Alternatively, if we can fully describe an uncountable but continuous distribution (either in natural language or in mathematics) then we can still count it as being indexed by parameters, the parameters in this case being whatever lexical tokens are used to describe the function (possibly a countably infinite number of them, if the definition contains terms like (∀i ∈ Z)). So H can be indexed in the discrete case and
560
Jason Grossman
X is a space of possible observations. xobs is an actual observation. It is either the result of a single experiment, or the totality of results from a set of experiments which we wish to analyse together. When xobs is the only observation being used to make inferences about a hypothesis space H, I will refer to xobs as the actual observation. I assume that it represents all observations considered relevant to any of the hypotheses in some set H.
3.2
Four caveats
Caveat: I only discuss simple hypotheses. Simple hypotheses are ones which give probabilities to potential observations. The contrast here is with complex hypotheses, also known as models, which are sets of simple hypotheses such that knowing that some member of the set is true (but not which) is insufficient to specify probabilities of data points. See for example [Forster, 2006] for an extended discussion of the problems introduced by complex hypotheses. Caveat: I only discuss inference from data to hypotheses. I will not be concerned with experimental design (in which hypotheses but not data are typically known), nor with hypothesis generation (if there is such a thing). Caveat: I ignore the relative desirability of finding one hypothesis or another to be true. To rephrase in statistical terminology: I will be ignoring utility functions, or (equivalently) loss functions. (A loss function is just a utility function multiplied by −1.) Since my focus is on the scientific uses of inference, this may seem like a reasonable assumption. Sadly, it is not clear to me that it is. It seems to me that the best way to do inference is often to weight conclusions about hypotheses — each of which is a possible error — according to the desirability of avoiding the amount of error represented by each hypothesis. On the other hand, it is only sometimes possible to do this, since in general there may be agreement in a community as to non-normative facts, including probabilistic facts, but not as to the desirability of hypotheses. Moreover, even when there is agreement as to the desirability of hypotheses among the people directly concerned with a statistical inference, they are likely to need to justify their results to a wider public which does not share their values. So in many cases my caveat will be an appropriate simplifying assumption, even if not in all cases. Caveat: When I need to consider epistemic decision makers at all, I assume there is only one of them. in all describable continuous cases. In most systems of pure mathematics there are, provably, indescribable functions; but as philosophers of applied mathematics we need not worry about them too much. Even so-called “non-parametric statistics” can be indexed in practically all cases. Usually this is trivial, as non-parametric models are usually discrete and so have a countable number of hypotheses which are easy to order.
The Likelihood Principle
561
See [Kadane et al., 1999] for a number of deep arguments about the complications which multiple decision makers introduce into Bayesian theory. I do not consider such complications here. 4
THE LIKELIHOOD PRINCIPLE: PRECISE STATEMENT
We want the likelihood principle to say something like the following: We should analyse Table 1, and any similar table, using only the numbers in the single column corresponding to the data which were actually observed. Or, more mathematically: The only permissible contribution of a space of possible observations X to inferences about a set of simple hypotheses {hi } is via the probabilities those hypotheses give to the actual observation, namely p(xobs |hi ). These probabilities, p(xobs |hi ), are known as the likelihood function of xobs . The likelihood function is a function from hypotheses to probabilities. The likelihood function of an observation need not sum to 1. Moreover, it is convenient to allow the identity conditions of likelihood functions to be different from those of functions in general. Two likelihood functions are considered the same iff they are proportional: i.e., iff L1 (h) = c × L2 (h) for some c > 0.7 However, the LP is only true under the following conditions. Conditions of applicability 1. We cannot infer anything about the relative importance of the various possible inferential errors from the observation (i.e., the loss function, or equivalently the utility function, is either independent of the observation or unimportant). ([Good, 1981], reprinted in Good 1983, p.132) 2. The choice of observations is not informative about the hypotheses, although of course the outcome of the observations hopefully will be informative about the hypotheses [Hill, 1987; Berger and Wolpert, 1988]. The distinction here is analogous to the distinction in quantum theory between the setting on a measurement apparatus and the outcome of a measurement. 7 A further condition on equality of likelihood functions is that L (h) and L (h) are only 1 2 considered equal if their variables have the same physical interpretations. This is because we use the meanings of likelihood functions, not merely their shapes. This is important in the literature on the LP in general (although not in this chapter) because some proofs of the LP use relabellings of variables, which are valid only as long as this condition is observed. The first (sketch) proof of the LP [Pratt, 1961, p.166] was of this form; others, such as [Birnbaum, 1962, pp.277–278], arguably break the condition; see [Pratt, 1962, pp.315–316].
562
Jason Grossman
It is easy to construct cases in which the LP fails if these conditions are broken. [Gelman et al., 1995] provide several particularly nice illustrations of the importance of the second condition, which they call ignorability of the method of data collection. For example: Suppose for instance that we, the authors, give you, the reader, a collection of the outcomes of ten rolls of a die and all are 6’s. Certainly your attitude toward the nature of the die would be different if we told you (i) these were the only rolls we performed, versus (ii) we rolled the die 60 times but decided to report only the 6’s, versus (iii) we decided in advance that we were going to report honestly that ten 6’s appeared but would conceal how many rolls it took, and we had to wait 500 rolls [Gelman et al., 1995, p.192] to attain that result. This is a case in which the choice of observations is informative about the hypotheses, and so the likelihood principle does not apply. The experimenter should report the sampling scheme in such cases, so one option is to dismiss such cases as inaccurate reporting of data.8 However, while the experimenter may be misreporting the experiment, they are (arguably) reporting the observation accurately, so they would might (arguably) be covered by the LP. For the sake of clarity it is important to exclude such cases explicitly. Fortunately, the likelihood principle makes sense and is useful even when such cases are excluded. In addition, the statement of the LP itself needs to be refined a little. It is difficult to make precise the notion that the analysis should proceed via the likelihood function, or should depend only on the likelihood function. The only way to be really clear about this notion, it seems, is to say what the analysis should not take into account. Looking back at tables 1 to 3, this makes sense: saying that the analysis should not take into account anything outside column 1 is a precise way of saying that the analysis should only use column 1, without having to say anything about how the analysis should use column 1. These considerations lead to the definition which I foreshadowed earlier: The likelihood principle (LP) Under the conditions of applicability mentioned above, inference procedures which make inferences about simple hypotheses should not be justified by appealing to probabilities assigned to observations which have not occurred, except for the trivial constraint that those probabilities place on the probability of the actual observation under the rule that the probabilities of exclusive events cannot add up to more than 1. Therefore the only component of p(x|h) which should be used to justify such inferences is the likelihood function p(xobs |h), where x ranges over the sample space, h ranges over the hypothesis space and xobs is the actual observation. 8 Thanks
to an anonymous referee for this point.
The Likelihood Principle
563
It is worth remembering two points here. (i)
Strictly speaking, the second sentence is superfluous; what is doing the work is the emphasised clause. The point of the second sentence is to remind us why (historically) the LP is called the likelihood principle.
(ii) The LP is a constraint on statistical inference; that is, on the analysis of observations. The LP does not mention experimental design, and it does not apply to other statistical problems which fall outside the category I am calling statistical inference, such as calculating long-term averages over repeated actual experiments (so-called “quality control”) [Hacking, 1965].
5
A PROOF OF THE LIKELIHOOD PRINCIPLE
The proof given here is largely based on [Berger and Wolpert, 1988] which in turn is largely based on the seminal proof of [Birnbaum, 1962].
5.1
Premise: The weak conditionality principle (WCP)
The following well-known problem adapted from [Cox, 1958] introduces the main premise of the proof. Suppose that we are doing an experiment to test a hypothesis h0 and that we decide to go along with the frequentist idea that we should imagine repetitions of the experiment and make sure that at most 5% of them give the wrong answer, on the assumption that h0 is true. An almost realistic thought experiment which sheds light on our options involves sending blood to one of two pathology laboratories according to which of them sends the next pick-up courier, or according to the toss of a coin. One laboratory is known to send back an estimated haemoglobin count with a large amount of random error; the other lab always sends back a count that’s almost exactly correct. To achieve an overall 5% error rate as defined above we need to take into account both error rates. So if the blood actually went to the accurate laboratory, we need to adjust the error rate on the grounds that it could have gone to the inaccurate one. This is unsatisfactory, and of course it is not what any practising statistician would do. What she would do is take into account only the characteristics of the laboratory the blood actually went to. In this particular case at least, one should treat the coin toss and the laboratory measurement as two separate experiments, regardless of whether they were planned together. Taking the result of the coin toss into account is called conditioning on the data. Cox himself, in his [Cox, 1958, (1958)], came to the conclusion that we should perform a conditional calculation (i.e., take into account only the characteristics of the laboratory the blood actually went to) in this particular case. He did not have available to him the proof of the LP, which shows (on very mild assumptions) that this case generalises to practically all statistical inferences.
564
Jason Grossman
Cox — like all non-Bayesian statisticians at the time — held that one should condition only under special circumstances, but was unable to work out exactly what those circumstances were. His example was therefore seen as a paradox. More recently, some frequentists have developed rather sophisticated conditional frequentist theories for conditioning on parts of the data (see, for example, [Casella and Berger, 1987; Berger et al., 1994]). However, we will see that the necessity of conditioning in Cox’s example, combined with a very plausible axiom of sufficiency, is enough to prove rigorously that one should follow the likelihood principle, which in turn entails that one should condition on all the available data. The LP is a normative principle, and the idea of proving a normative principle may seem strange in the light of Hume’s insistence that we can’t derive an ought from an is. What makes it possible is that the conditioning premise drawn from the Cox example is also normative (although very, very weak): it says that we must take into account the properties of one laboratory and must ignore the properties of the other. The other premise which we will need for the proof, the weak sufficiency principle, is also normative. It is possible to prove a normative principle from other normative principles without violating Hume’s rule. The chain of reasoning from these premises to the likelihood principle is purely mathematical. By proving the likelihood principle from a premise about conditioning which mentions only a single coin toss, we can see that if we are squeamish about conditioning on observed data in some cases then we must bite the bullet and disown conditioning in all cases, even in Cox’s case; because if we condition in Cox’s case then we have the likelihood principle, and that in turn entails that conditioning is always necessary (assuming the other premise of the proof, and the conditions of applicability which I’ve spelled out above). The Cox example hardly needs to be generalised at all to form the first premise we need. The laboratories will be generalised to arbitrary experiments. The coin toss can stay as it is. Informal statement: If one of two possible experimental measurements is chosen by the toss of a fair and indeterministic coin, no inference procedure should require information about the measurement that was not performed. Formal statement: The weak conditionality principle: Consider two experimental measurements M1 = (X1 , H, p1 ) and M2 = (X2 , H, p2 ). (By this I mean that M1 has sample space X1 , hypothesis space H and probability or probability density function p1 , and similarly for M2 .) Note that the set of hypotheses is the same for each. This is a deliberate restriction which entails that this principle does not apply to hypotheses about alchemy compared with hypotheses about chemistry, although it does apply to comparing statistical models that each encompass both alchemy and chemistry. This condition is useful for the
The Likelihood Principle
565
proof, and it is guaranteed to be satisfied if we stick to the condition (stated earlier) that H contain all the hypotheses under consideration. Now consider an observation from a new experiment, M ∗ , which consists of using a fair, epistemically indeterministic coin to select one of M1 and M2 at random with probability 1/2 each. Formally, M ∗ = ( (J, XJ ), H, pj (J, XJ ) ). By “epistemically indeterministic” I mean simply that no deterministic pattern in the behaviour of the coin has been noted or is expected. Some people are known to be able to toss a coin so as to yield a pre-determined outcome. Our coin-tosser must not be one of those people. Suppose M ∗ is performed, and turns out to consist of M1 . Then any inference procedure should derive the same inferences about H from this instance of M ∗ as it would have derived from M1 alone. The weak conditionality principle is called “weak” because logically stronger conditionality principles are sometimes used for similar purposes [Evans et al., 1986, p.185].
5.2
Premise: The weak sufficiency principle (WSP)
The simplest definition of sufficiency is as follows: sufficiency definition 1 T (x) is a sufficient statistic for h iff p(x|T (x)) is independent of h. A sufficient statistic for h, T (x), typically contains much less information about the world than x does, but the same amount of information (in a sense which I will make precise) about h. For example, if x is a vector of the heights of a sample of people then, under the Normal or log-Normal models most often used for human Σn xi heights, the average (mean) height of the sample, T (x) = xin=1 , is a sufficient statistic for the average (mean) height of the population. The reason for the name “sufficient” is that if T (x) is sufficient for h (in the technical sense above) then it is all we need to know about x, if our sole purpose is to infer things about h, and so it is sufficient information in the lay sense. Anything else we know about x, over and above T (x), is epistemically redundant. For example, if we’re sure that all we care about is the average height of a population (a big if), there is no point in recording more than the average height of the test sample; any other information about the test sample can be thrown away. An equivalent definition of sufficiency is: sufficiency definition 2 T (x) is a sufficient statistic for h iff we can find a function T ′ which allows p to be factorised in the following way:
566
Jason Grossman
(∀h ∈ H)
p(x|h) = T ′ (T (x), h) × p(x|T (x)).
The factorization theorem shows that this definition is equivalent to the earlier definition.9 Seeing that statistical sufficiency is epistemically satisfactory is even easier using definition 2 than it was using definition 1. For definition 2 shows that all functions of p(x|h) can be calculated from T (x) and h, when T is sufficient for h. This point may look superficially as though it assumes the likelihood principle, but it does not. That all inferences about h depend on p(xobs |h) is, more or less, the likelihood principle; but that all such inferences depend on p(xi |h) for some set {xi }, is an unrelated, trivial claim, and that is what I am relying on here. For example, in most experiments on coin tossing, the number of heads and the number of tails are jointly sufficient for all inferences; the order in which we observe the heads and tails is irrelevant. (x1 , x2 , . . . are jointly sufficient iff the ordered tuple hx1 , x2 , . . .i is sufficient.) All statistical models have sufficient statistics, trivially, since the identity function on X is a sufficient statistic for h according to the above definitions. Of course, such trivial sufficient statistics are not very useful. In addition, a model may have many sufficient statistics. The weak sufficiency principle: If T (x) is a sufficient statistic for h, and if T (x1 ) = T (x2 ), then inference procedures should not derive different inferences about H from x1 and x2 . (adapted from Basu [1975, p.9]) The weak sufficiency principle was named thus by Dawid [1977] because it is weaker (claims less) than other similar principles. The WSP is essentially the LP limited to a single experiment. As Leslie [2008] puts it, the difference between the WSP and the LP is like the difference between a State law and a Federal law in a Federal country like Australia or Switzerland (see also [Welsh, 1996, p. 79].) To make this point obvious, suppose that an experiment has no obvious sufficient statistic for a parameter θ except for the trivial one consisting of the whole data . . . but suppose further that two data points x1 and x2 have the same probabilities under all hypotheses. Then the LP says that we should treat x1 and x2 equally in inferences about θ. But it also follows that θ does have a non-trivial sufficient statistic after all, namely the (rather ungainly) statistic consisting of the whole data with all occurrences of x1 replaced by x2 . Applied to this sufficient statistic, the WSP also says that we 9 One part of the factorization theorem is easy to prove. It is easy to see that if this equation holds then T (X) is sufficient for h on definition 1, thus: if we know T (x) then we know the righthand side as a function of h (bearing in mind that we can calculate p(x|T (x)), because it does not depend on h); hence, we know the left-hand side, which establishes that T (x) is sufficient for h. The converse is more long-winded to prove, and I will not be relying on it (since my working definition of sufficiency will be the second version, and all I need show is that whatever fits my definition also fits the other one, not vice versa), so I will omit the proof.
The Likelihood Principle
567
should treat x1 and x2 equally in inferences about θ. This is hardly a rigorous argument, but I hope it makes the equivalence of the WSP and the LP in a single experiment clear.10 I do not think that any statistician ever deliberately breaks the WSP. If a conflict with the WSP ever arises, the only reasonable conclusion is that T (x) is not a sufficient statistic for h after all. I would like to give three arguments for this. I do not claim that the three arguments are independent of each other; only that one might convince where the others fail. Firstly, the WSP follows directly from the claim (defended above) that statistical sufficiency entails epistemic sufficiency. A second argument for the WSP, adapted from [Basu, 1975, p.9], is as follows. Let us imagine that we have observed xobs in a two-step procedure: we have first conducted an experiment with sample space X but noted only the value of T (x), not the precise value of x. Then we conduct a further, separate experiment with sample space T (x), noting this time the exact value xobs which we obtain. Since T is sufficient for h, the second experiment is “statistically trivial” (Basu’s term) and tells us nothing about h. Hence, the outcome of the second experiment can make no difference to our inferences about h. Hence values x1 and x2 which are possible outcomes of the second experiment (i.e., such that T (x1 ) = T (x2 )) should lead to the same inferences about h. Thirdly, here is a Bayesian argument for the WSP. It can be proved that if T (x) is sufficient for h, as defined above, then (∀x) p(h|T (x)) = p(h|x). So knowing T (x) allows us to know the entire function p(h|x) (as a function of h).
6
A PROOF OF THE LIKELIHOOD PRINCIPLE: CONTINUATION FROM THE WSP AND THE WCP
Consider two statistical measurements M1 = (X1 , H, p1 ) and M2 = (X2 , H, p2 ). Next, consider the mixed experiment M ∗ (which was defined in the statement of the weak conditionality principle as follows: a fair, epistemically indeterministic coin is tossed; according to its outcome, one of the experiments M1 and M2 is performed). Now suppose that whichever experiment hasn’t been performed yet is also performed. At this stage we have an outcome x1 from M1 , an outcome x2 from M2 , an outcome j indicating which experiment was performed first (j = 1 for M1 and j = 2 for M2 ), and an outcome from M ∗ . The outcome from M ∗ is J = 1 or 2 and x∗ = x1 or x2 . The possible outcomes are denoted (j, xj ). 10 I give a rigorous proof of the equivalence of the WSP and the LP in the presence of a further axiom (the WCP) later in this chapter, and it will be obvious that the WCP is trivially true in the context of a single experiment, so that proof will establish formally that the WSP is equivalent to the LP in the context of a single experiment.
568
Jason Grossman
Then let t0 be the arbitrarily chosen data point (0, 0) and consider the statistic T (j, xj )
= t0 if (j, xj ) = (1, x1 ) or (2, x2 ) = (j, xj ) otherwise.11
Is T a sufficient statistic for h? Generally, no. Recall that T is a sufficient statistic for h iff p can be factorised as (∀h ∈ H)
p(j, xj |h) = T ′ (T (j, xj ), h) . p(j, xj |T (j, xj )).
There need not, in general, exist a suitable T ′ to match our choice of T . But suppose that the likelihoods of x1 and x2 are equal (i.e., p1 (x1 |h) ∝ p2 (x2 |h), or (∃k > 0)(∀h ∈ H)(p1 (x1 |h) = k.(p1 (x1 |h)). To prove the likelihood principle, we require to show that T is now sufficient for h. Let T ′ be as follows: T ′ ((j, xj ), h)
= 21 p1 (X = x1 |h) +
1 2 p2 (X
= x2 |h),
if (j, xj ) = t0
= p(j, xj |h) otherwise. Then T ′ (T (j, xj ), h)
= 21 p1 (X = x1 |h) +
1 2 p2 (X
= x2 |h),
if (j, xj ) = (1, x1 ) or (2, x2 )
= p(j, xj |h) otherwise. To calculate p(j, xj |T (j, xj )) (the final term in the sufficiency equation), note: p((1, x1 )|T = t0 , h)
= p∗ (J = 1|T = t0 , h) . p1 (X1 = x1 |T = t0 , h) = 12 p1 (X1 = x1 |T = t0 , h) =
1 2 p1 (X=x1 |h) 1 1 2 p1 (X=x1 |h) + 2 p2 (X=x2 |h)
p((2, x2 )|T = t0 , h)
=
1 2 p2 (X=x2 |h) 1 1 2 p1 (X=x1 |h) + 2 p2 (X=x2 |h)
p((j, xj )|T
= (j, xj ), h) = 1,
, by symmetry
(j, x) 6= t0 .
11 The otherwise clause can never represent an actual outcome, since I have defined the indexing of the experiments in such a way that only (1, x1 ) can occur if M1 is performed first and only (2, x2 ) can occur if M2 is performed first. If we already knew that the likelihood principle was true, we might not be interested in non-actual outcomes. But as this is a proof of the LP, we can consider at least the mathematical properties of such outcomes.
The Likelihood Principle
569
Now we can check the sufficiency equation: If J = 1, X1 = x1 then T ′ (T (j, xj ), h) . p(j, xj |T (j, xj )) = 21 p1 (X = x1 |h) +
1 2 p2 (X
= x2 |h) ×
1 2 p1 (X=x1 |h) 1 1 p (X=x 1 1 |h) + 2 p2 (X=x2 |h) 2
= 12 p1 (X = x1 |h) = p(j, xj |h). By symmetry, if J = 2, X2 = x2 then T ′ (T (j, xj ), h) . p(j, xj |T (j, xj )) = p(j, xj |h). And for all other (J, XJ ), T ′ (T (j, xj ), h) . p(j, xj |T (j, xj )) = p(j, xj |h) × 1. This establishes that T is sufficient for h. Given this sufficiency of T for h, and and given the weak sufficiency principle applied to the fact that (T (1, x1 ) = T (2, x2 ) , it follows that no inference about h is valid on observation (1, x1 ) in the mixed experiment unless it is also valid on (2, x2 ). Now recall that j is chosen by a fair, indeterministic coin toss. Consequently, the weak conditionality principle applies. It tells us that no inference about h is valid on (1, x1 ) unless it is also valid on x1 alone. (x1 corresponds to M1 in my formal statement of the weak conditionality principle above.) In other words, the observations (1, x1 ) and x1 are equivalent in terms of the inferences they license. Similarly, the observations (2, x2 ) and x2 are equivalent in terms of the inferences they license. And we determined in the previous paragraph that the observations (1, x1 ) and (2, x2 ) are also equivalent to each other. Hence, the observations x1 and x2 license the same observations as each other. To summarise this paragraph: if we write “≡” for “license the same observations as each other”, we have just shown that x1 ≡ (1, x1 ) ≡ (2, x2 ) ≡ x2 . The relation ≡ is transitive, so x1 ≡ x2 . Consequently, no inference is valid on x1 (regardless of the value of j) unless it is also valid on x2 . We have proved this for any x1 and x2 with equal likelihoods under the models under consideration. It follows that any two observations which share likelihood functions must license the same inferences. This is enough to establish some of the versions of the LP found in the literature. To establish the version I have given, though, a little more work must be done.12 12 Recall
that the version to be established is as follows: Under the conditions of applicability
570
Jason Grossman
Consider any experiment M = (X , H, p), let the J outcome of M be labelled x, and based on M and x define a new experiment M = (Y, H, pY ) where Y is 1 or 0 according to whether X = x or not, thus: 1 if X = x (1) Y = 0 if X 6= x so that: (2) p(Y = 1|h) = p(x|h)
and p(Y = 0|h) = 1 − p(x|h).
So (∀h) pY (Y = 1|h) ∝ p(x|h): the observations Y = 1 and x share likelihood functions. Hence no inference is valid on the observation of x in M unless it is J also valid on the observation of Y = 1 in M . (All this is a trivial consequence of the part of the likelihood principle already proved above.)JNow we should ask which inferences are valid on the observation of Y = 1 in M . If our only observation is that Y = 1, whatever we can infer about H from X and xJmust be a function of the functions of X and x that appear in the description of M . The only such functions are p(x|h) and 1 − p(x|h) (from (2), or directly from (1) if you prefer). But these are just the likelihood function of x and 1 minus the likelihood function of x. And, in particular, no mention of any part of X J J except x is made in the description of M . So all inferences from M and hence from M must depend functionally on x only via the likelihood function, and in particular no inferences from M may use probabilities of any part of X except the part which was actually observed. This establishes the full likelihood principle. The premises used in this proof form a minimal set of premises for proving the likelihood principle (although not of course the only minimal set), as can be shown by proving the premises of the proof from the conclusion of the proof (the likelihood principle). I will do this by showing that the likelihood principle implies each premise separately. Since it is implied by both premises jointly, it must then be logically equivalent to their union, since if (a ∧ b) ⊢ c and c ⊢ a and c ⊢ b then c ≡ (a ∧ b). It follows directly from the likelihood principle that the correct conclusion in Cox’s example is to ignore the characteristics of the laboratory not used. To prove the the weak conditionality principle given above (the formal version of that solution to Cox’s paradox), we note that in experiment M ∗ , p(j, xj |h) =
1 pj (xj |h) ∝ pj (xj |h). 2
given above, inference procedures which make inferences about simple hypotheses should not be justified by appealing to probabilities assigned to observations which have not occurred, except for the trivial constraint that those probabilities place on the probability of the actual observation under the rule that the probabilities of exclusive events cannot add up to more than 1. Therefore the only component of p(x|h) which should be used to justify such inferences is the likelihood function p(xobs |h), where x ranges over the sample space, h ranges over the hypothesis space and xobs is the actual observation.
The Likelihood Principle
571
So M ∗ and Mj have proportional likelihood functions, where Mj is the measurement chosen by the coin toss. Hence M ∗ and Mj licence identical inferences. Hence only Mj matters. To prove the weak sufficiency principle from the likelihood principle, note that if T is sufficient for H then (by definition) p(X|T (X)) is independent of h. If T (x1 ) = T (x2 ) (as in the premises of the weak sufficiency principle) then p(x1 |T (x1 ), h) = p(x2 |T (x2 ), h). Then p(x1 |h) = p(x2 |h) — x1 and x2 have identical likelihood functions. So, by the likelihood principle, any inference procedure should draw the same conclusions from x1 as from x2 . This completes the proof that the likelihood principle is logically equivalent to the conjunction of the weak conditionality principle and the weak sufficiency principle.
7
OTHER VERSIONS OF THE LIKELIHOOD PRINCIPLE
R. A. Fisher seems to have made the first statements that could be considered to represent the likelihood principle. For example, Likelihood serves all the purposes necessary for the problem of statistical estimation. ([Fisher, 1930b], reprinted in [Aldrich, 2000, p.171]) A fuller and more precise statement of the principle, and the first version to picked up and developed by other authors, was by George Barnard in 1947: The connection between a simple statistical hypothesis H and observed results R is entirely given by the likelihood, or probability function L(R|H). If we make a comparison between two hypotheses, H and H ′ , on the basis of observed results R, this can be done only by comparing the chances of, getting R, if H were true, with those of getting R, if [Barnard, 1947, p.659] H ′ were true. The most influential discussions of the LP have been by Allan Birnbaum and (jointly) by James O. Berger and Robert Wolpert. [Birnbaum, 1962] provided the first full proof of the LP in 1962 (essentially the same proof as I give in this chapter), while [Berger and Wolpert, 1984] and [Berger and Wolpert, 1988] gave a more general proof and a much more complete discussion, also answering many objections both to the LP and to the proof. [Berger and Wolpert, 1988] is still by far the most complete reference on the LP to date. Both [Birnbaum, 1962] and [Berger and Wolpert, 1988] state the LP in terms of the evidence or information which the data give us about our hypotheses. To do this they both introduce an “evidence function” Ev, which they define in interestingly different ways. In [Birnbaum, 1962], Ev is defined primarily in
572
Jason Grossman
terms of the premises used in his proof of the principle.13 In [Berger and Wolpert, 1988], Ev is left deliberately vague in some places, and in others replaced with “information”, as in the following definition of the LP: Two likelihood functions for θ (from the same or different experiments) contain the same information about θ if they are proportional to one another . . . [where] θ represents only the unknown aspect of the probability distribution of X. . . . A second qualification for the LP is that it only applies for a fully specified model {fθ }. If there is uncertainty in the model, and if one desires to gain information about which model is correct, that uncertainty must be incorporated into the definition of θ. . . . A third qualification is that, in applying the LP to two different experiments, it is imperative that θ be the same unknown quantity in [Berger and Wolpert, 1988, pp.19–21.2] each. The LP has perhaps always been obvious to at least some Bayesians. Explicit statements by influential Bayesians have included the following: The prior probability of the hypothesis has nothing to do with the observations immediately under discussion, though it may depend on previous observations. Consequently the whole of the information contained in the observation that is relevant to the posterior probabilities of different hypotheses is summed up in the values that they give to [Jeffreys, 1961, p.57] the likelihood If two sets of data, x and y, have the following properties: (i) their distributions depend on the same set of parameters; (ii) the likelihoods of these parameters for the two sets are the same; (iii) the prior densities of the parameters are the same for the two sets; then any statement made about the parameters using x should be the same as those made using y. The principle is immediate from Bayes’s Theorem because the posterior distributions from the two sets will be equal. [Lindley, 1965, p.59] Two possible experimental outcomes D and D′ —not necessarily of the same experiment—can have the same (potential) bearing on your opinion about a partition of events Hi , that is, P (Hi |D) can equal P (Hi |D′ ) for each i. Just when are D and D′ thus evidentially equivalent, or of the same import? . . . P (D′ |Hi ) = kP (D|Hi ). . . . the likelihood principle: Two (potential) data D and D′ are of the same import if [this equation] obtains. [Edwards et al., 1963, p.237] 13 Specifically, the weak sufficiency principle and the weak conditionality principle, defined above.
The Likelihood Principle
573
The last of these is important because it makes explicit that the data need not come from the same experiment, and hence that the likelihood principle is not merely the sufficiency principle in disguise. As we will see below, some Bayesians have questioned whether the LP really does follow from Bayesianism. Finally, some authors have distinguished between weak and strong versions of the LP, where the weak version applies to a single experiment and the strong version compares data from two experiments, as follows: By the term ‘statistical data’ we mean . . . a pair (E, x) where E is a well-defined statistical experiment and x the sample generated by a performance of the experiment. . . . To begin with, let us agree to the use of the notation Inf(E, x) only as a pseudo-mathematical short hand for the ungainly expression: ‘the whole of the relevant information about [the world] contained in the data (E, x)’. ... (The weak likelihood principle) : Inf(E, x′ ) = Inf(E, x′′ ) if the two sample points x′ and x′′ generate equivalent likelihood functions ... (The likelihood principle) : If the data (E1 , x1 ) and (E2 , x2 ) generate equivalent likelihood functions on Ω, then Inf(E1 , x1 ) = Inf(E2 , x2 ). [Basu, 1975, pp.1–11] As discussed above, Basu’s “weak likelihood principle” is really just a sufficiency principle. Basu has also given a nice paraphrase of the LP without the arguably ambiguous word “information”: We are debating about the basic statistical question of how a given data d = (E, [xobs ]), where E = (X, Ω, p) is the model and [xobs ] is the sample, ought to be analysed. . . . the likelihood principle . . . asserts that if our intention is not to question the validity of the model E but to make relative (to the model) judgements about some parameters in the model, then we should not pay attention to any characteristics of [Basu, the data other than the likelihood function generated by it. ] 1975, p.62 Other versions of the LP can be found in [Savage and discussants, 1962, p.17], [Edwards, 1972, p.30], [Savage, 1976, p.474], [Berger, 1980, p.25], [Lindley, 1982, p.432], [Good, 1983, pp.35–36], [Good, 1983, p.132], [Hill, 1987], [Berry, 1987, p.118], [Berliner, 1987], [Berger, 1993], [Stuart et al., 1999, p.438], [Barnett, 1999, p.188], [Casella and Berger, 2002, p.291] and [Royall, 2004, p.126].
574
7.1
Jason Grossman
The law of likelihood 6= the likelihood principle
It is vital to distinguish between two principles with confusingly similar names. One is the likelihood principle. The other is the law of likelihood, which says something superficially similar but actually very much more ambitious. Here is the law of likelihood: Law of likelihood: If hypothesis A implies that the probability that a random variable X takes the value x is pA (x), while hypothesis B implies that the probability is pB (x), then the observation X = x is evidence supporting A over B if and only if pA (x) > pB (x), and the likelihood ratio, pA (x)/pB (x), measures the strength of that evidence [Royall, 1997, p.3] [Hacking, 1965]. If the defence of the law of likelihood in [Hacking, 1965; Edwards, 1972; Royall, 1997] is successful then its success is inherited by the likelihood principle, because the law of likelihood entails the likelihood principle (provided the two principles are stated with the conditions of applicability spelled out above: no relevant utility functions and noninformative choice of observations). But if the law of likelihood falls, that does not necessarily reflect badly on the likelihood principle, because the likelihood principle is much weaker. The important difference between the two principles is that the law of likelihood talks about an observation supporting one hypothesis to a greater extent than another, while the likelihood principle makes no mention of the extent to which an observation supports a hypothesis. This may seem an unproblematic difference, or even no difference at all, since the likelihood principle talks about the conditions under which an observation supports two hypotheses equally. Moreover, both principles imply that the following statement describes a function Ev which in some sense tells us the evidential support that xobs provides for h1 and h2 : If p(xobs |h1 ) = p(xobs |h2 ), then Ev(h1 |xobs ) = Ev(h2 |xobs ). For proponents of the law of likelihood, this statement is straightforwardly true, and Ev is a number. For proponents of the likelihood principle it is also true. But they need not hold that Ev is a number. If they are Bayesian, for example, they hold that Ev is a function (the likelihood function) of arbitrarily high dimension. Whatever Ev is, presumably it can be reduced to a number for some purposes (e.g. in betting scenarios), although this may mean throwing away information. The law of likelihood asserts that it can be measured by a number under all circumstances. The likelihood principle does not. That is why the likelihood principle is much weaker than the law of likelihood. The likelihood principle is also not to be confused with the method of maximum likelihood, which was probably invented by Gauss [Fisher, 1930a, p.531] and was popularised by Fisher [1921]. This method starts by considering the likelihood function p(xobs |h) with xobs fixed (at whatever was actually observed) and h variable. h is then estimated by maximising p(xobs |h). Although it is by no means
The Likelihood Principle
575
the same as the LP, the method of maximum likelihood uses only the likelihood function and is therefore compatible with the LP.
7.2
The likelihood principle for infinite hypothesis spaces
The proofs above do not go through for arbitrary probability density functions, because of ambiguities in the notion of sufficiency [Basu, 1975] [Evans et al., 1986] [Berger and Wolpert, 1988, pp.28-30]. But they do go through (with very minor modifications) for continuous functions. In addition, the following theorem of Berger and Wolpert can be proved from measure-theoretically more sophisticated versions of the weak sufficiency principle and weak conditionality principle. The theorem shows that the likelihood principle applies to some (in a sense, most) non-continuous infinite hypothesis spaces. Let φ : U1 → U2 be a Borel bimeasurable one-to-one mapping from U1 ⊂ ℵ1 onto U2 ⊂ ℵ2 , and suppose there exists a strictly positive function c on U1 such that for all θ ∈ Θ, pθ (A) =
Z
φ−1 (A)
1 pθ (dx1 ), c(x1 )
A ⊂ U2 .
Then an inference can only be drawn from the observation x if it can also be drawn from the observation φ(x), for all x except for a set of probability zero (regardless of the value of θ). If it is agreed to ignore the possibility of events of probability zero then inferences about Θ may depend on ℵ1 and ℵ2 only via x and φ(x). (Adapted from [Berger & Wolpert, 1988, pp.33-34])
When combined with my version of the LP, this says that the likelihood principle is true for any finite set of hypotheses and for any parametric infinite set of hypotheses and for many non-parametric infinite sets of hypotheses.
7.3
Bjørnstad’s generalisation: Continuation of the likelihood principle
Jan F. Bjørnstad has proved a version of the likelihood principle which applies even when the hypotheses to be examined depend on the observed data. This is a case which is excluded by my general framework, but I will briefly state Bjørnstad’s theorem because it holds the prize as the most general version of the likelihood principle to have been proved to date. Define M as (X , h, p) as previously, but this time let the quantity about which we wish to draw inferences be not h but λ, and let λ be a function of x, thus:
576
Jason Grossman
λ = λ(y, ψ), where ψ represents the unknown quantities which are being treated as variables. Let θ represent the unknown quantities which are being treated as parameters. Let h = (ψ, θ). Then inferences
from x ∈ X to λ must be depend on x only via the ordered pair λ, p(x|λ, θ) . The first term in this pair is new. The second term is a likelihood function, but not the same likelihood function as in the simpler case proved above (in which it was p(x|h)). See [Bjørnstad, 1996] for a proof of this principle. In real scientific cases λ(x) is often independent of x, and in this common case Bjørnstad’s likelihood principle reduces to the simpler likelihood principle proved above. 8
IS THE LIKELIHOOD FUNCTION WELL DEFINED?
A number of objections have been raised to the LP. There is not enough space here to reply to them all; in any case, almost all of them stand or fall with the proof of the LP. See [Basu, 1988], [Berger and Wolpert, 1988] and references cited in the latter for discussions of some of these. In the space available, I would like to discuss a philosophically interesting objection to the LP which does not stand or fall with the proof, because it objects to an aspect of the way in which I have formulated the problem. Some Bayesians have argued that Bayesianism is effectively incompatible with the likelihood principle, on the grounds that there is no such thing as an isolated likelihood function [Bayarri et al., 1987]. They argue that in a Bayesian analysis there is no principled distinction between the likelihood function and the prior probability function. If that is so, then we will not be able to draw tables like tables 1 to 3, nor apply the definition of the LP, even though the main part of the definition does not explicitly mention the likelihood function. This objection is motivated, for these Bayesians, by the fact that (they say) we need prior probabilities in order to apply the LP. Once we have admitted the universal necessity of using prior probabilities we will no longer need to separate the likelihood function from the prior [Bayarri et al., 1987; Berger and Wolpert, 1988]. Thus, they accept proofs of the likelihood principle, conditional on the assumption that a likelihood function has been specified; but they deny that specifying a likelihood function is necessary, and they deny that it is possible to do so in a principled way. They believe that the likelihood principle is true, if stated carefully, but not straightforwardly applicable. Despite disputing the applicability of the likelihood principle in this way, Bayesians in this school often see it as a useful weapon with which to combat frequentism. I like to think of this view as Bayesian Hegelianism, as it sees the likelihood principle as an important part of a historical dialectic which will inevitably lead to a
The Likelihood Principle
577
synthesis in which it is no longer required. Such a prediction has been beautifully summarised by Bayarri, DeGroot and Kadane, following a metaphor proposed by Butler [1987]: The [frequentist] Cheshire Cat vanished quite slowly, first the tail and then the body of frequentist methods. The last visible part was the likelihood [principle] grin, “which remained some time after the rest of it had gone”. But that, too, disappeared. [Bayarri et al., 1987, p.27] To return to the objection itself: the claim is that there is no principled definition of the likelihood function because there is no principled way of deciding what should be labelled x (data) and what should be labelled h (hypothesis) in the definition of the likelihood as p(x|h). Bayarri, DeGroot and Kadane’s examples all involve the following set-up.14 Suppose that the random variable Y is not observed but another random variable X is observed with conditional density p(xobs |y, ψ). [Then] it is irrelevant which of the factors on the right-hand side [of p(ψ, y|xobs ) =
p(xobs |y)p(y|ψ)p(ψ) R ] p(xobs , y, ψ)
are regarded as part of the [likelihood function] and which are regarded [Bayarri et al., 1987, pp.6–7] as part of the prior distribution. This is correct, from a Bayesian point of view. (See also [Hill, 1987, p.99].) In contrast, we do have to distinguish the likelihood function in order to apply the likelihood principle. Three natural choices are p(xobs |ψ), p(xobs , y|ψ) and p(xobs |y, ψ) [≡ p(xobs |y)], but there is no natural way to choose between these three possibilities . . . or so Bayarri, DeGroot and Kadane claim. The problem for the likelihood principle, as thus stated, is very easily solved. One need merely specify what one means by “likelihood function”. I have already done this: for me, the likelihood function is always p(xobs |y, ψ), where xobs is what is observed. As Berliner [1987] correctly notes, the likelihood principle “applies equally well, though separately, in each of the potential cases [which Bayarri, DeGroot and Kadane] enumerate”, so my solution is perfectly adequate, as would be any other solution which serves to disambiguate the term “likelihood function”. However, it may appear that a problem remains, since others may disambiguate the likelihood function differently from me. For example, Bayarri, DeGroot and Kadane imagine a case in which two doxastic agents see the same observation, and analyse it using the same mathematical model except that one of them introduces an unobserved variable y into the model while the other does not. This leads the 14 Throughout this section I replace Bayarri, DeGroot and Kadane’s y by x obs , x by y, θ by ψ, and f by p, in order to remain consistent with the terminology of this chapter.
578
Jason Grossman
two agents to define the likelihood function in different ways, following which they cannot use the likelihood principle to compare their results. To see that my version of the likelihood principle still applies, we have only to note that these two agents are using different hypothesis spaces H: for one of them, H includes a specification of an unobserved variable, while for the other it does not. Given a fixed H (which is an explicit precondition of my version of the likelihood principle), only one likelihood function is possible, namely p(xobs |h ∈ H). (Note that they agree on xobs ; otherwise no joint analysis of any sort would be possible.) A practical problem remains if neither of the two agents accepts the other’s parameterisation of the hypothesis space, but there is no reason why this should happen, since the two parameterisations essentially agree with each other (more precisely, one parameterisation is easily reducible to the other by taking a marginal distribution with respect to y).15 ACKNOWLEDGEMENTS Thanks to Prasanta Bandyopadhyay, James O. Berger, Alan H´ajek, Claire Leslie, Alison Moore, Max Parmar, Daniel Steel and David Spiegelhalter for discussions which contributed to this chapter, and to an anonymous referee for many helpful suggestions. However, the views given here do not necessarily represent the views of these people, and any errors are my own. BIBLIOGRAPHY [Aldrich, 2000] John Aldrich. Fisher’s “Inverse Probability” of 1930. International Statistical Review, 68(2):155–172, 2000. [Barnard, 1947] G. A. Barnard. Review of Wald’s ‘Sequential analysis’. Journal of the American Statistical Association, 42:658–669, 1947. [Barnard, 1985] George Barnard. Discussion of ‘In defense of the likelihood principle: axiomatics and coherency’. In J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, editors, Bayesian Statistics, volume 2, pages 57–60. Elsevier, 1985. [Barnett, 1999] Vic Barnett. Comparative Statistical Inference. John Wiley, New York, 3rd edition, 1999. [Basu, 1975] Debabrata Basu. Statistical information and likelihood (with discussion). Sankhy¯ a Series A, 37:1–71, 1975. [Basu, 1988] Debabrata Basu. Statistical information and likelihood. A collection of critical essays., volume 45 of Lecture Notes in Statistics. Springer-Verlag, New York, 1988. [Bayarri and Berger, 2004] M. J. Bayarri and J. O. Berger. The interplay of Bayesian and frequentist statistics. Statistical Science, 19(1):58–80, 2004. [Bayarri et al., 1987] M. J. Bayarri, M. H. DeGroot, and J. B. Kadane. What is the likelihood function? (with discussion). In Shanti S. Gupta and James O. Berger, editors, Statistical decision theory and related topics IV, volume 1, pages 3–27. Springer-Verlag, New York, 1987. [Berger and Wolpert, 1984] James O. Berger and Robert L. Wolpert. The Likelihood Principle. Institute of Mathematical Statistics, Hayward, California, 1st edition, 1984. [Berger and Wolpert, 1988] James O. Berger and Robert L. Wolpert. The Likelihood Principle. Institute of Mathematical Statistics, Hayward, California, 2nd edition, 1988. 15 See also a similar solution to the problem in [Berliner, 1987, p.19] and alternative solutions in [1987] and [Berger and Wolpert, 1988, p.39].
The Likelihood Principle
579
[Berger et al., 1994] James Berger, Lawrence D. Brown, and Robert Wolpert. A unified conditional frequentist and Bayesian test for fixed and sequential simple hypothesis testing. The Annals of Statistics, 22:1787–1807, 1994. [Berger, 1980] James O. Berger. Statistical Decision Theory: Foundations, Concepts, and Methods. Springer–Verlag, New York, 1980. [Berger, 1993] James Berger. An overview of robust bayesian analysis. Technical Report 93-53C, Purdue University, 1993. [Berger, 2006] James Berger. The case for objective Bayesian analysis. Bayesian Analysis, 1(3):385–402, 2006. [Berliner, 1987] Mark Berliner. Discussion of ‘What is the likelihood function?’. In Shanti S. Gupta and James O. Berger, editors, Statistical decision theory and related topics IV, volume 1, pages 17–20. Springer-Verlag, New York, 1987. [Berry, 1987] D. A. Berry. Interim analysis in clinical trials: the role of the likelihood principle. The American Statistician, 41:117–122, 1987. [Birnbaum, 1962] Allan Birnbaum. On the foundations of statistical inference. Journal of the American Statistical Association, 57(298):269–306, June 1962. [Bjørnstad, 1996] Jan F. Bjørnstad. On the generalization of the likelihood function and the likelihood principle. Journal of the American Statistical Association, 91(434):791–806, June 1996. [Butler, 1987] R. W. Butler. A likely answer to ‘What is the likelihood function?’. In Shanti S. Gupta and James O. Berger, editors, Statistical decision theory and related topics IV, volume 1, pages 21–26. Springer-Verlag, New York, 1987. [Casella and Berger, 1987] George Casella and Roger L. Berger. Reconciling Bayesian and frequentist evidence in the one-sided testing problem. Journal of the American Statistical Association, 82:106–111, 1987. [Casella and Berger, 2002] George Casella and Roger L. Berger. Statistical Inference. Duxbury, Pacific Grove, 2nd edition, 2002. [Cox, 1958] D. R. Cox. Some problems connected with statistical inference. The Annals of Mathematical Statistics, 29(2):357–372, June 1958. [Dawid, 1977] A. P. Dawid. Conformity of inference patterns. In J. R. Barra et al, editor, Recent Developments in Statistics, pages 245–256. North-Holland, Amsterdam, 1977. [Edwards et al., 1963] W. Edwards, H. Lindman, and L. J. Savage. Bayesian statistical inference for psychological research. Psychological Review, 70:193–242, 1963. [Edwards, 1972] A. W. F. Edwards. Likelihood. Cambridge University Press, London, 1972. [Evans et al., 1986] M. Evans, D. A. S. Fraser, and G. Monette. On principles and arguments to likelihood. Canadian Journal of Statistics, 14:181–199, 1986. [Fisher, 1921] R. A. Fisher. On the probable error of a coefficient of correlation deduced from a small sample. Metron., 1:3–32, 1921. [Fisher, 1930a] R. A. Fisher. Inverse probability. Proceedings of the Cambridge Philosophical Society, 26:528–535, 1930. [Fisher, 1930b] R. A. Fisher. Inverse probability (abstract). In Report of the meeting of the British Association for the Advancement of Science, page 302, Bristol, 1930. British Association for the Advancement of Science. [Forster, 2006] Malcolm R. Forster. Counterexamples to a likelihood theory of evidence. Minds and Machines, 16(3):319–338, 2006. [Gelman et al., 1995] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis. Chapman and Hall, London, 1995. [Good, 1981] I. J. Good. Some logic and history of hypothesis testing. In Joseph C. Pitt, editor, Philosophical Foundations of Economics, University of Western Ontario Series on the Philosophy of Science, pages 149–174. D. Reidel, Dordrecht, 1981. [Good, 1983] I. J. Good. Good Thinking. University of Minnesota Press, Minneapolis, 1983. [Hacking, 1965] Ian Hacking. Logic of Statistical Inference. Cambridge University Press, London, 1965. [Hill, 1987] Bruce M. Hill. The validity of the likelihood principle. American Statistician, 41(2):95–100, 1987. [Jeffreys, 1961] Harold Jeffreys. Theory of Probability. Oxford University Press, Oxford, 3rd edition, 1961.
580
Jason Grossman
[Kadane et al., 1999] Joseph B. Kadane, Mark J. Schervish, and Teddy Seidenfeld. Rethinking the Foundations of Statistics. Cambridge Studies in Probability, Induction, and Decision Theory. Cambridge University Press, Cambridge, 1999. [Leslie, 2008] Claire Leslie. Exhaustive Conditional Inference: Improving the Evidential Value of a Statistical Test by Identifying the Most Relevant P-Value and Error Probabilities (PhD Thesis). University of Melbourne, Melbourne, 2008. [Lindley, 1965] D. V. Lindley. Introduction to Probability and Statistics from a Bayesian Viewpoint. Cambridge University Press, Cambridge, 1965. [Lindley, 1982] Dennis V. Lindley. The role of randomization in inference. PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association, Volume Two: Symposia and Invited Papers:431–446, 1982. [Lindley, 1983] D. V. Lindley. Response to ‘parametric empirical Bayes inference’ by Morris. Journal of the American Statistical Association, page 381?, 1983. [Pratt, 1961] J. W. Pratt. Review of Lehmann’s ‘Testing Statistical Hypotheses’. Journal of the American Statistical Association, 56:163–166, 1961. [Pratt, 1962] J. W. Pratt. Discussion of ‘On the foundations of statistical inference’ by Allan Birnbaum. Journal of the American Statistical Association, 57:314–315, 1962. [Robbins, 1955] H. Robbins. An empirical Bayes approach to statistics. In Procedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 157–164. University of California Press, Berkeley, 1955. [Royall, 1997] Richard Royall. Statistical Evidence: A likelihood paradigm. Chapman and Hall, London, 1997. [Royall, 2004] Richard Royall. The likelihood paradigm for statistical evidence. In Mark L. Taper and Subhash R. Lele, editors, The Nature of Scientific Evidence: Statistical, Philosophical, and Empirical Considerations, pages 119–138. University of Chicago Press, Chicago and London, 2004. [Savage and discussants, 1962] L. J. Savage and discussants. The Foundations of Statistical Inference. Methuen, London, 1962. [Savage, 1976] L. J. Savage. On rereading R. A. Fisher (with discussion). Annals of Statistics, 42:441–500, 1976. [Stuart et al., 1999] Alan Stuart, J. Keith Ord, and Steven Arnold. Kendall’s Advanced Theory of Statistics Vol. 2A: Classical Inference and the Linear Model. Arnold, London, 6th edition, 1999. [Welsh, 1996] A. H. Welsh. Aspects of Statistical Inference. John Wiley, New York, 1996.
Part V
Recent Advances in Model Selection
This page intentionally left blank
AIC, BIC AND RECENT ADVANCES IN MODEL SELECTION Arijit Chakrabarti and Jayanta K. Ghosh
OVERVIEW As explained in e.g., [Ghosh and Samanta, 2001], model selection has somewhat different connotations in Statistics and History or Philosophy of Science. In the latter it has come to mean a major shift in paradigm on the basis of available data, of which one of the most famous examples is the shift from Newtonian Physics to Einstein’s relativistic Physics on the basis of data obtained in a famous expedition of Eddington (Example 1). In Statistics it has a useful but much more pedestrian role of distinguishing between two statistical models on the basis of available data. For example, is the data coming from a normal distribution or a Cauchy distribution (see Example 5)? One of the most popular applications of Classical Statistical Model Selection is to determine which variables are important in a regression model (equivalently a linear model) for the dependent response variable y, in terms of the auxiliary variables x in the model (Example 2). However, the Classical Statistical Model Selection Rules can also be used for problems of paradigm shift (Example 1). We use the word “Classical Statistics” to distinguish it from “Bayesian Statistics”. A standard introduction to Classical Statistics is [Lehmann and Casella, 2001]. A definitive overview of Bayesian Statistics is provided in [Bernardo and Smith, 1994]. The interpretation as well as motivation of Statistical Model Selection Rules depend on whether one believes in the Classical or Bayesian paradigm of Statistics as well as the loss (or utility) function. Of the two most well-known Statistical Model Selection Rules, namely AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion), AIC has a classical origin whereas BIC arises as an approximation to a Bayes rule up to O(1) (the exact meaning of this statement will be explained in Section 3). At this level of approximation, one may ignore the prior distribution of the Bayesian. By recent advances in the title we mean both recent improvements in understanding the properties and performance and relevance of AIC and BIC as well as new model selection rules and other advances. Our review of the new rules and work related to them will be brief since the AIC and BIC are our primary focus. AIC was proposed by Akaike in a series of papers [Akaike, 1973; 1974]. He seemed to be guided by optimal prediction of a new set of y’s corresponding to a Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
584
Arijit Chakrabarti and Jayanta K. Ghosh
replicate of the observed x’s. It was proved first by Shibata (see [Shibata, 1981; 1983]) that in certain important problems, for large samples, AIC predicts better than any other model selection rule. He proves it by showing that asymptotically it predicts as well as an Oracle, where the Oracle is a model selection rule which always selects the best model for prediction. The prediction error of the Oracle gives a lower bound to the error committed in prediction for all model selection rules. The exact form of the Oracle considered by Shibata is given in section 2. Shibata’s ideas were considerably simplified in [Li, 1987]. Interesting general results based on Li’s ideas were obtained by [Shao, 1997]. We discuss more about the predictive properties of AIC in Sections 2 and 4. BIC was introduced by Schwarz [1978] and can be used for approximating the Bayes Factor corresponding to two models and is discussed in some detail in Section 3. We briefly introduce the Bayes Factor here. Suppose M1 and M2 are two models specifying two families of densities p(x|θ1 ), θ1 ∈ Θ1 and p(x|θ2 ), θ2 ∈ Θ2 for the given data, where Θ1 and Θ2 are the two parameter spaces corresponding to the two models. The Bayes Factor BF21 is the ratio of the marginal (probability) density of the data under M2 , namely P (data|M2 ), and that under M1 , namely, P (data|M1 ), with the latter in the denominator and the former in the numerator. If BF21 > 1, one chooses M2 . Otherwise one chooses M1 . A Bayesian assumes a priori probabilities for the two models under consideration and using the data posterior probabilities of the two models, namely P (M1 |data) and P (M2 |data) are obtained. BF21 is also equal to the ratio of the posterior odds (of model M2 with respect to model M1 ) and the prior odds. If the prior probabilities of M1 and M2 are taken to be half, BF21 is the same as posterior odds. The technical definitions of the marginal densities of data and posterior probabilities of models appear in Section 3. indexprior odds’indexHastie, T. Both AIC and BIC are special cases of penalized likelihood rules, which may be described as follows. Suppose M2 is a model specifying a family of densities p(x|θ), θ ∈ Θ2 for the given data and M1 is a submodel (1) p(x|θ), θ ∈ Θ1 ⊂ Θ2 . This is a case of nested models. ( It must noted that in this case P (M2 |data) 6= P (θ ∈ Θ2 |data). We discuss this in Section 3.) The method of maximum likelihood, so popular in Classical Statistics, would suggest evaluating each model Mi by the maximized likelihood, or equivalently the maximized log-likelihood, of the data x under Mi , namely sup log p(x|θ), where “sup” denotes supremum. θ ∈Θi Since Θ1 ⊂ Θ2 , sup log p(x|θ) ≤ supθ ∈Θ2 log p(x|θ), θ ∈Θ1 if we compare only the maximized log-likelihood we would always choose the more complex model. It is intuitively obvious that this may not be a good thing to do for all data. As pointed out in [Hastie et al., 2003; Forster and Sober, 1994], a very complex model fitting the data too well is ignoring the fact that the data is
AIC, BIC and Recent Advances in Model Selection
585
composed of both signals; i.e., significant, repeatable aspects as well as noise, i.e., random perturbations of the data.
Y
X
Figure 1. As in Figure 1 above, a simpler model fitting a simple straight line is likely to be better than the zigzag curve passing through all the observed points. This is where the so-called principle of parsimony or Ockham’s Razor comes in. It suggests one should penalize each model according to its complexity. Consider the simple situation when Θ2 = Rd1 and Θ1 = Rd2 for some positive integers d2 and d1 such that d1 < d2 . In such a case, the usual dimension of the parameter points belonging to each individual model, i.e, d1 or d2 is a simple measure of complexity of the model. This is so since larger the dimension, richer is the model in the sense of having a larger number of independently varying parameter components, and more complex it is to make inference on them. This discussion leads to the penalized log-likelihood supθ ∈Θi log p(x|θ) − λdi , where di is the dimension of Θi and λ is a positive constant specifying the penalty per unit dimension. If λ = 1, we get AIC. If λ = log2 n , we get BIC. We choose the model which maximizes the penalized log-likelihood. For the rest of the paper, we will use the notations p or pi to denote model dimensions. For a long time it has been unclear which of AIC and BIC or some other penalized log-likelihood criterion is appropriate. Till very recently, there has not been much effort to understand the penalized likelihood methods in a unified manner, which would provide formal theoretical arguments in favor of each type of penalty for certain specific purposes and tell how and where each penalty comes from a particular mathematical principle. The situation is somewhat clearer now
586
Arijit Chakrabarti and Jayanta K. Ghosh
(see Section 4) with respect to AIC and BIC. Rapid progress in understanding is taking place for other rules also (see Section 5). This understanding in the case of AIC is similar to that of [Forster and Sober, 1994] and we also don’t believe that AIC is a solution to all model selection problems. It is worth mentioning in this context that in a conference held in Oberwolfach in 2005 (which is also mentioned in Section 5), serious research aimed at a unified understanding of penalized likelihood methods were presented. In Section 4 we suggest AIC and BIC are appropriate in somewhat different problems, the difference is due to different purposes and different loss functions. Section 5 contains a brief review of some recent advances.
1
EXAMPLES
In this section we present several examples where model selection techniques can be applied to answer scientific or statistical questions. Examples 1 through 4 appeared in [Ghosh and Samanta, 2001], with appropriate references to sources of the data. EXAMPLE 1 Eddington’s experiment. According to Einstein’s theory of gravitation, light gets deflected by gravitation and the amount of such deflection can also be specified. More specifically, Einstein famously predicted that under the gravitational attraction of the Sun, the light emanating from nearby stars will get deflected, but such an effect would only be visible during a total solar eclipse (when such deflection can be measured through apparent change in a star’s position). To verify this prediction, a famous experiment was conducted by a team led by British astrophysicist Eddington immediately after the first world war. Four observations were collected on the amount of angular deflection (measured in seconds) by Eddington’s team and other groups (spread over a period of 10 years) and those turned out to be X1 = 1.98, X2 = 1.61, X3 = 1.18 and X4 = 2.24. Einstein predicted that the amount of deflection would be 1.75. Suppose we assume that the Xi ’s are independently normally distributed about the unknown mean µ, i.e Xi ∼ N (µ, σ 2 ), where σ 2 is unknown. To statistically test if Einstein’s conjecture were true based on the observed data, we can consider choosing between the two models M1 : µ = 1.75 and M2 : µ 6= 1.75 using some model selection techniques. It is also possible to formulate this as a problem of choosing one of two nested models, in which case M2 would permit µ to have all possible real values. Since that leads to some subtle logical questions, we postpone discussion of nested models for section 3. (However examples 3 and 4 in this section are formulated as nested models). Even though σ 2 is unknown here, just to illustrate how to use these two methods, let us treat σ 2 as known and equal to the sample variance s2 . Then of the original data X, the model selection making the transformation X ′ = X−µ s problem becomes the same as choosing between the two models M1 : µ = 0 and M2 : µ is a non-zero real number, where µ now denotes the mean of the transformed observations. BIC is the appropriate model selection rule here since we
AIC, BIC and Recent Advances in Model Selection
587
Table 1. Hald’s Cement hardening data x1 7 1 11 11 7 11 3 1 2 21 1 11 10
x2 26 29 56 31 52 55 71 31 54 47 40 66 68
x3 6 15 8 8 6 9 17 22 18 4 23 9 8
x4 60 52 20 47 33 22 6 44 22 26 34 12 12
y 78.6 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4
want to select the true model. The values of the BIC criterion for M1 and M2 are -5.675 and -6.37 respectively and so BIC selects M2 . Although AIC is not really appropriate for this purpose, it is a curious fact that AIC also selects the model M2 in this problem. So, according to both these criteria, Einstein’s prediction of µ = 1.75 is supported by the observed data, although the evidence is not very strong. This particular data is now only of historical importance. Much stronger confirmation of Einstein’s theory has come from other experiments, see [Gardner, 1997]. EXAMPLE 2 Hald’s regression data. Table 1 below presents data on heat evolved during the hardening of Portland cement and four variables (all or a subset of which) may be able to explain the amount of heat evolved. These four variables are called the explanatory variables denoted by x1 , x2 , x3 and x4 , which measure respectively the percentage weight of four chemical compounds (which together constitute cement) and the response variable is denoted by y, which measures the total calories given-off during hardening per gram of cement after 180 days. This data set has been analyzed several times before, e.g., [Berger and Pericchi, 1995; Burnham and Anderson, 2003]. One traditional approach in statistics to deal with such data sets is to represent y in a normal linear regression model as yi = β0 + β1 x1i + · · · + βk xki + ǫi , i = 1, . . . , n, where x1 , . . . , xk generically denote the regressors, which according to this particular model, explain the repeatable aspect of the variation in the y values and ǫi are independent and identically distributed as N (0, σ 2 ). Here βi ’s are the unknown parameters. The different models in such situations correspond to the different pos-
588
Arijit Chakrabarti and Jayanta K. Ghosh
sible choices of regressor variables from the pool of all potential regressors. For example, for thedata in Table 1, one can consider a total of 24 −1 = 15 possible models with the following choices of regressors {x1 }, . . . , {x4 }, {x1 , x2 }, . . . , {x3 , x4 }, {x1 , x2 , x3 }, . . . , {x2 , x3 , x4 } and {x1 , x2 , x3 , x4 }. The purpose of choosing a model or a set of models here would be to pinpoint which regressors x’s seem to have the most significant causal relationship with the amount of heat evolved y. Here AIC chooses the model with regressors x1 , x2 and x4 while BIC chooses the model with regressors x1 and x2 . The AIC and BIC values for each model are reported in [Ghosh and Samanta, 2001], expressed as a difference from the value for the model, which is selected by that criterion. Thus the AIC value for model {x1 , x2 } is -0.225, while that for model {x1 , x2 , x4 } is 0. On the other hand, the BIC value for these two models are 0 and -1.365 respectively. So AIC favors {x1 , x2 , x4 } while BIC favors {x1 , x2 }. EXAMPLE 3 Nested Model selection Problem and hypothesis testing. Suppose X1 , . . . , Xn are independent but identically normally distributed with mean µ and variance 1. Suppose the question to a statistician is whether the mean µ is 0 or not. The statistician formulates this problem as the testing problem with the null hypothesis H0 : µ = 0 versus the alternative hypothes H1 : µ 6= 0. Note that H0 and H1 specifies two disjoints subsets of R for the unknown µ. The same problem can also be formulated as one of selecting between two models M0 : µ = 0 and M1 : µ ∈ R. Note that here M0 is nested within M1 . If the data supports µ = 0, then it is consistent with both models but one will choose the simpler model on grounds of parsimony, which requires that of all models which explain the data equally well, one should choose the simplest model. As explained earlier in the Overview, simplicity is defined in terms of the dimension of the model. A more detailed discussion of nested models appears in Section 4. EXAMPLE 4 ANOVA type problems and High dimensional Setup. Suppose one has p similar normal populations, each population i having a potentially different mean µi , i = 1, . . . , p. We have r observations from each population and let n = rp. One might be interested in knowing whether these p means are the same or not. This question be thought of as a question of choice between the two models M1 : µ1 = · · · = µp vs M2 : µi ’s are arbitrary. This is again a nested model problem. This is called the one-way ANOVA (Analysis of Variance) problem. Stone [1979] considered the situation where r is fixed but p → ∞ and hence n = pr also tends to infinity and critically studied the performance of AIC and BIC. Stone showed that AIC performs better than BIC in identifying the true model in the sense that BIC chooses M1 with probability tending to 1 under M2 if the µ’s satisfy certain conditions and p grows sufficiently fast, while AIC chooses the correct model M2 with probability tending to 1. Data sets for which the p =dimension=number of parameters is large is called high dimensional. High dimensional data sets appear often these days in many applications. A
AIC, BIC and Recent Advances in Model Selection
589
prime example of this is when one wants to test gene expression levels simultaneously for thousands of genes (p in this form) but has a rather small set of data points (r in this form) on each gene, each data point corresponding to an individual. The level of expression of a gene, for example could be related to its effect on some tumor or character of the individual. It has great medical significance. EXAMPLE 5 Normal vs. Cauchy. Suppose the following observations (when placed in increasing order of magnitude) are obtained in an experiment and the question is which distribution do the data come from : {−6.56759, −3.14456, −1.19043, −0.64666, −0.64624, −0.54472, −0.43171, −0.34207, −0.32573, −0.31834, −0.29348, −0.19512, −0.14658, −0.12093, −0.0328, 0.075277, 0.131894, 0.187061, 0.199137, 0.214316, 0.226209, 0.471883, 0.654485, 0.719648, 0.788128, 0.911007, 0.946036, 1.28061, 1.676115, 7.986715}. Looking at the data, it seems very likely that the distribution from which the data set is generated is symmetric around zero. But we are not sure what the form of the distribution is, but suspect that it might either be a normal or a Cauchy with an unknown scale parameter. The normal and Cauchy distributions with the same location are often not easy to distinguish in moderate sample sizes, but in this case we have two observations in the data set which seem too far away from the other observations, which is more likely to happen in the case of a Cauchy distribution, because its tails are very thick. But in order to decide more objectively, we want to check statistically whether M1 : The unknown distribution is a Cauchy distribution with location 0 is true or M2 : The unknown distribution is a normal distribution with location 0 is true. We use BIC for this purpose. It chooses model M1 , the difference in BIC criterion value for the two models being 51.7069. In fact BIC chooses the correct model here since the data were simulated from a Cauchy distribution.
2
THE AKAIKE INFORMATION CRITERION (AIC)
In this section we consider the Akaike Information Criterion (AIC) in a few canonical statistical problems and state results of its statistical optimality therein. We also discuss its connection with other model selection criteria and some of the generalizations of it. The optimality is connected with Akaike’s original motivation as brought out in [Forster and Sober, 1994] but it does not follow as an immediate consequence. In fact the proofs are quite non-trivial. We start by introducing the linear model. Consider Y = (Y1 , . . . , Yn )′ , a vector of observations of the response (dependent) variables and let X = (X1 , . . . , Xp ) be the (n × p) matrix of explanatory variables, Xj , j = 1, . . . , p being the j-th column of X. This is the same as the set up in Example 2 in Section 1, but now written in matrix notation. In the linear model, as the name suggests, one connects the mean vector µ = E(Y |X) (assuming that X is fixed) with the explanatory variables via the relationship µ = Xβ, where β ∈ Rp is the unknown parameter
590
Arijit Chakrabarti and Jayanta K. Ghosh
of interest. It is further assumed that Y = Xβ + ǫ, where ǫ ∼ N (0, σ 2 In×n ) is the vector of random errors and σ 2 is its variance, which may be known or unknown. In this context a model M specifies a certain subset of the β vector to be equal to zero while the others are allowed to be arbitrary. We will, for simplicity, assume that the model space is M = {M1 , . . . , Mp }, where under model Mj , βk = 0 for k > j while β1 , . . . , βj are arbitrary. This is the nested model scenario (as also seen in Examples 3 and 4), since if Mj is true then so are Mj+1 , . . . , Mp , for any any j ∈ {1, . . . , p}. Model Mj postulates that only the first j explanatory variables are potentially responsible for the variability in the repeatable aspect of the Y ′ s (measured by the mean) while the others do not contribute anything. Assuming that σ 2 is known, the Akaike Information Criterion (AIC) for model Mj is ˆ ) − j, AIC(j) = log L(β j ˆ is the maximum likelihood estimator of β under model Mj and log L(β ˆ ) where β j j ˆ is the joint density of the data under model Mj evaluated at β j . AIC chooses Mj which maximizes AIC(j) among j ∈ {1, 2, . . . , p}. (Under the assumption that X ′ X = Ip×p , one can easily derive this criterion using direct calculation in this setup.) Upon simplification, this criterion can be written equivalently as (ignoring constants depending on n but independent of j) ˆ j ||2 + 2jσ 2 , AIC(j) = ||Y − X β and one minimizes AIC(j) over j ∈ {1, 2, . . . , p} to choose the best model, according to this criterion. If σ 2 is unknown, AIC(j) becomes (with σ 2 estimated by its maximum likelihod estimator under model Mj ), n n n log(2π) + + log 2 2 2
ˆ j ||2 ||Y − X β + (j + 1). n
An alternative method estimates σ 2 by either the maximum likelihood estimator of σ 2 under the largest model or some estimator which is consistent under all models. It can be shown that the difference between the AIC for unknown σ 2 and this form of AIC with a plug-in estimator of σ 2 is, for large sample size n, approximately a constant depending on n but independent of the j (i.e the model under consideration), under pretty mild conditions. If, instead of the nested model scenario, one considers M to be any collection of models (each of which specifies a certain subset of the coordinates of β to be zero), the definition of AIC(M ) (upto constants independent of models) for a generic M ∈ M with number of free ˆ j ||2 + 2pM σ 2 if σ 2 is known and parameters pM , becomes AIC(M ) = ||Y − X β 2 ˆ j || ||Y −X β AIC(M ) = log + 2 pnM if σ 2 is unknown. n
AIC, BIC and Recent Advances in Model Selection
591
One very important problem where AIC can be used as a model selection rule is the problem of nonparametric regression, where the functional form of dependence between the dependent variable and the regressor is not expressible in terms of finitely many unknown parameters, as, for example, in the usual polynomial regression problem where the regression function may be known to be a polynomial of degree five in the regressor, with the six coefficients of the polynomial being the unknown parameters. Instead, the nonparametric regression model states that the expected value of Y given x is some unknown f (x) where f can belong to a pretty large class of functions, e.g, say, the class of all functions which are square integrable. The reason for assuming that f might belong to a very large class of functions is the general perception that the relationship could be pretty complex. To illustrate our point, we describe a practical example where one does not really have much clue about the relationship between x and Y to start with and the use of nonparametric regression is much more appealing intuitively than the usual parametric regression. The cosmic microwave background (CMB) radiation data from the Wilkinson Microwave Anisotropy Probe (WMAP) is analyzed in several chapters of [Wasserman, 2006] using different approaches to nonparametric regression. The basic data is a temperature map obtained by the WMAP, showing the temperature in different points of the sky 13 billion years ago. The fluctuations in the temperature map, measured through the strength of temperature fluctuations f (x) (called power spectrum) at each frequency x (or “multipole moment”), provide information about the early universe. So estimation of f (.) is the most interesting thing to cosmologists. Through an appropriate procedure, the temperature map can be transformed to a scatterplot of estimated power Y versus frequency x, given by (x1 , Y1 ), . . . , (xn , Yn ). The goal of nonparametric regression will be to estimate the function f , based on the Y ’s with very little assumption on its functional form. In general, based on observations Yi , i = 1, 2, . . . , n of the dependent variable at regressor values {xi , i = 1, . . . , n}, one writes the nonparametric regression model as Yi = f (xi ) + ǫi , i = 1, . . . , n, where f is assumed to belong to given (large) class of functions and ǫi are i.i.d. errors with zero mean and finite variance. If f is square integrable, one can represent the function uniquely as an expansion which is an infinite linear combination of certain sine and cosine functions (called the basis functions), appearing in a specific order. This, in mathematical parlance, is known as the Fourier expansion of the function. Each function is determined by the coefficients (called the Fourier coefficients) that multiply the basis functions in its expansion, i.e., if the Fourier coefficients of two functions are the same, then the two functions must be identical almost everywhere. So estimating f becomes same as estimating its Fourier coefficients. Since one only has finite amount of data, infinitely many Fourier coefficients can’t be estimated. Noting that as one goes further down the expansion, the Fourier coefficients become increasingly negligible, one natural solution of estimating f then is to approximate it by an appropriately chosen partial sum of
592
Arijit Chakrabarti and Jayanta K. Ghosh
this expansion and the problem of choosing a partial sum becomes one of variable selection in linear regression with the basis functions as the regressors (variables to choose from), and the Fourier coefficients as the regression coefficients. This way, the problem of estimation of f also becomes one of model selection, where each model specifies which basis functions will be used to describe f . Note that this way each model is a false model, being a finite sum of sine and cosine terms, approximating the true infinite sum. A popular choice of models is by taking them as nested ones, with model Mk considering the partial sum involving the first k basis functions and Fourier coefficients. AIC can be defined here exactly as in the linear model setup considered earlier, and by choosing a model here one simply wants to select the correct order of the partial sum for the given sample size with the goal of good estimation of f and hence good predictive performance with the selected model. Chakrabarti and Ghosh [2006a] show that in this nested model scenario, the model selected by AIC achieves this goal by proving that the estimate (using least squares estimates of the Fourier coefficients under the model chosen by AIC) of the unknown function converges to the truth very fast, at the so-called minimax rate. We will not delve into the technical details of this statement in this paper. Many authors have studied asymptotic optimality properties of AIC in terms of predictive performance. To sum up, this line of research shows that under some conditions, AIC is able to predict as well as an Oracle asymptotically. The Oracle provides a lower bound for predictive performance which may possibly be attained only asymptotically but can not be implemented for finite sample size since the Oracle depends on the unknown value of the parameter. For example, in the nonparametric regression problem, Shibata [1983] defined an Oracle as ˆj ||2 , Mn∗ = argminM ∈Mn E||Xβ − X β where the X in the above expression denotes the n × ∞ design matrix involving all the basis functions, β is the true Fourier coefficient of the unknown function, βˆj is the least squares estimate of β under model Mj and Mn is the model space which varies with sample size. It is easy to see that Mn∗ depends on the true β for each n. Shibata [1983] showed that the ratio of the risk of the model selected by AIC and Mn∗ tends to 1 as n → ∞, for each true β. We are now in a position to briefly indicate Akaike’s original rationale or motivation behind the definition of the criterion named after him, without going deep into the technical details. Suppose f is the true unknown density from which i.i.d sample observations Y1 , . . . , Yn are generated. The job of the statistician is to mimic the truth based on sample data. In parametric inference one considers a set of models M and each model consists of densities indexed by a finite number of parameters. The goal is to find a model which is closest to the truth in the sense that it contains a density which is closest (among all densities included in all the candidate models) to the true density in terms of some appropriate measure of divergence. As a measure of divergence between the true density f and an approximating density g, Akaike considered the Kullback-Leibler divergence, given
AIC, BIC and Recent Advances in Model Selection
by K(f, g) =
Z
f (x) log
593
f (x) dx. g(x)
(Note that for measuring closeness (or lack of it) between two densities, the Hellinger distance and the Kullback-Leibler divergence are standard measures used by the statisticians and have been used by philosophers also in a slightly different context (see [Joyce, 1999]). For each model Mk in M, one can find the maximum likelihood estimator θˆk based on Y1 , . . . , Yn under that model. Then, letting gk (.|θˆk ) as the representative density from model Mk , Akaike considered minimizing over Mk ∈ M, the criterion Z 1 f (Y new ) new (2) E{ f (Y ) log dY new }, ˆk ) n g(Y new |θ
where the expectation is taken with respect to the true density f and Y new denotes an independent sample of size n from f . This quantity in (2) measures the divergence per observation, between the predicted density and the truth. (Criterion (2) is the most general one, suitable for all situations, e.g. the linear regression setup described above. Yi ’s are i.i.d as in our case the criterion reduces But if the R new f (Y ) dY new }, where Y new is a sample of size one to E{ f (Y new ) log ˆ g(Y new |θ k ) from f .) Minimizing (2) is again equivalent to maximizing Z 1 ˆ k ))dY new }, E{ f (Y new ) log(g(Y new |θ n with respect to k. It was shown by Akaike that in large samples, an approximately unbiased estimator of the above quantity in (2) is given by 1 ˆ k ) − dim(Mk )), (log(L(θ n where dim(Mk ) is the number of free estimable parameters in model Mk . In the linear model example, under the assumption of orthogonal design R matrix, a simple calculation shows that an exactly unbiased estimator of n1 E{ f (Y new ) ˆ k ))dY new } is given by 1 (log(L(θ ˆ k )) − k). log(g(Y new |θ n Now we will briefly mention the connections of AIC with some other model selection criteria. Although the Akaike Information Criterion (AIC) is considered mainly as a non-Bayesian i.e., classical statistical criterion, there has been some studies which show that in certain normal linear model problems, as in Example 4, it has an Empirical Bayes interpretation (where at least some part of the prior is estimated from the data, see e.g., [Ghosh et al., 2006, chapter 9]), in the sense that under some conditions, Empirical Bayes Model selection rule and AIC choose the same model either for each sample size or asymptotically. In all these problems a model is chosen so as to minimize the expected posterior loss (using least squares estimates) in prediction of a new replicate of the dependent variable at a fixed
594
Arijit Chakrabarti and Jayanta K. Ghosh
predictor value. See [Mukhopadhyay and Ghosh, 2003; Chakrabarti and Ghosh, 2007] for further details on this. It is also worth mentioning here that AIC can be also related to a relatively recent model selection criterion DIC of [Speigelhalter et al., 2002]. Spiegelhalter et al. [2002] define, what they call, a Bayesian measure of model complexity or the effective number of parameters in a model using information theoretic considerations. DIC is then defined as a penalized version of a Bayesian measure of fit, the penalty being the model complexity. This is similar in spirit to the usual penalized likelihood model selection criteria, where a measure of fit is often measured by the (minus) twice-maximized log-likelihood. As observed in [Spiegelhalter et al., 2002; Chakrabarti and Ghosh, 2006b], under the assumption of posterior normality of the parameters in the model, DIC coincides with AIC asymptotically. Last but not the least, we would like to point out the connection of AIC with cross-validation. The cross-validatory way of model selection, as the name suggests, keeps a part of the sample for the estimation of the parameters in candidate models and uses the remaining part of the data for validation and hence choice of the appropriate model, the idea being not to use the same data twice for two different purposes. In its simplest form, namely the leave-1-out cross validation, one just keeps one observation away at a time and calculates for each candidate model, the sum of squared error prediction losses in predicting one observation based on all remaining observations by estimating the parameters in the model using them. One chooses that model which minimizes this cross-validatory criterion. Using certain regularity conditions, Stone [1977] argued that this form of cross-validation and AIC are equivalent, in the sense that the difference between the criteria values for a give model becomes negligible in large samples. But as observed in [Chakrabarti and Ghosh, 2007], this fact seemed to have been “overinterpreted” by people. It turns out that if the model under consideration is not the true model (which by the way is one of the crucial assumptions in [Stone, 1977]), then the widely believed equivalence fails to hold. Some concrete examples of this phenomenon and theoretical explanations are given in [Chakrabarti and Ghosh, 2007]. Finally, for non-nested (or subsets) model selection scenario a modification of AIC with an additional penalty seems to help choose a more parsimonious model, as studied in [Chakrabarti and Ghosh, 2007]. The modification of AIC is
AIC(M ) = log(θˆM ) − pM + pM log(
w ), 1−w
where 0 < w < 1. The use of such a modification can be partially motivated by seeing it as a penalty for the complexity of the model space, since more complex the model space, there are more comparisons to be made and there is more chance of choosing a larger dimensional model when in fact a smaller model is also true.
AIC, BIC and Recent Advances in Model Selection
3
595
BAYES FACTOR AND BIC
Suppose X1 , . . . , Xn are independent and identically distributed (i.i.d.) random variables and the models M1 , M2 , . . . , Mk specify k parametric families of densities. The model Mj specifies the density as p(x|θ j ) where θ j is a parameter in Θj , the parameter space corresponding to model Mj . The Bayesian analyst has to provide a prior πj (θ j |Mj ) for θ j conditional on the assumption that Mj is true. For notational simplicity, we will henceforth drop the subscript j and denote this prior as π(θ j |Mj ). Then the probability density of the data under Mj is defined, letting x = {x1 , . . . , xn }, as mj (x) = p(x|Mj ) =
Z Y n
Θj
1
p(xi |θ j )π(θ j |Mj ) dθ j .
It is known that the true model belongs to this class of k models, but it is not known which one is true. The Bayesian would assign prior probability πj for Mj k P to be true, where πj = 1. We consider 0-1 loss, i.e, the loss is 0 if a true model j=1
is chosen and is 1 if a false model is chosen. The Bayes rule is to choose the model Mj , for which the posterior probability given by mj (x)πj P (Mj |x) = P mj (x)πj
is the largest for the given x. If the k models are assumed to be equally likely, i.e πj = k1 , for j = 1, . . . , k, the Bayes rule reduces to choosing Mj that maximizes mj (x) for the given x. Consider now k nested models M1 , M2 , . . . , Mk specifying θ ∈ Θj , where Θ1 ⊂ Θ2 ⊂ · · · ⊂ Θk . We assume Mk is a true model, but we do not wish to choose it if a more parsimonious model is also true. It is assumed that under the most complex model there is a given density p(x|θ), θ ∈ Θk . This leads to a similar assumption for any Mj (j < k), namely, that under Mj , there is a density p(x|θ), θ ∈ Θj . We now discuss the usual assignment of probabilities to these models Mj . We distinguish logically between a model Mj and its associated Θj . Though the Θj ’s are nested in the usual set theoretic sense, we do not take the models as logically equivalent to their associated parameter sets. Rather, models are hypotheses about values of θ, often they are scientific hypotheses about natural phenomena. We consider a historical example. We imagine we are contemporaries of Galileo and are speculating on M1 : θ = 0
and M2 : θ is an arbitrary real number,
where θ is the true difference in falling times of two objects in vacuum. Notice that M1 is Galileo’s scientific hypothesis backed by his knowledge and intuition,
596
Arijit Chakrabarti and Jayanta K. Ghosh
and π(θ|M1 ) is Galileo’s conditional probability for θ, given by π(θ = 0|M1 ) = 1. On the other hand , the conditional probability that θ takes any particular value under π(θ|M2 ), is an assignment based on no particular knowledge and so small for all θ. The Bayesian may choose these conditional probabilities as his given that he tries to put himself in the same frame of mind as that of Galileo and others, and the background of each model. Alternatively, for each j, he may pretend that he believes in Mj and in that state of mind assign a conditional subjective probability distribution on Θj . This is admittedly difficult, if not impossible, but all we are trying to say is that π(θ|M1 ) and π(θ|M2 ) are logically unrelated, and that it makes sense to take, as Bayesians usually do, π(θ = 0|M1 ) = 1 and π(θ|M2 ) is some suitable low information prior suggested by Jeffreys. We wanted to make two points above through an illustrative example. The first is that π(θ|Mj ) for different Mj ’s are not logically determined by the assignment for the most complex model. The second is that we need the assignment π(θ|Mj ) to complete our Bayesian set up for selecting one of a set of nested models. k P A Bayesian would also assign prior probability πj to Mj , where πj = 1 j=1
and, usually, because of parsimony, π1 ≥ π2 ≥ · · · ≥ πk . The common (so called objective) choice is πj = k1 for all j ∈ {1, . . . , k}. In our historical example, this would be the Bayesian’s choice if he didn’t want to favour either Galileo or the general public. We emphasize again that the Bayesian does not logically identify the model with the associated parameter space and thus can still assign equal probability to two nested models indicating his degree of belief about the truth of two competing models. The conditional density of θ given Mj usually assigns much more probability to Θj ∩Θcj−1 than to Θj−1 . In fact usually, the conditional probability of Θj−1 is zero in most actual problems, because Θj−1 has usually lower dimension than Θj . Combining all the components we can calculate the conditional or posterior probability of Mj given x as R πj p(x|θ)π(θ|Mj ) dθ Θj p(Mj |x) = k . P ′ R ′ p(x|θ)π(θ|Mj ) dθ πj 1 Θj ′ The Bayes rule is to choose the model with maximum posterior probability. In the usual case where πj = k1 , j = 1, . . . , k, this is the same as choosing a model that maximizes Z p(x|θ)π(θ|Mj ) dθ Θj
which is usually called the marginal density of x obtained by integrating out θ. Note that since the π(θ|Mj )’s are different in the sense that one is not logically
AIC, BIC and Recent Advances in Model Selection
597
derived from the other, there is no reason to expect that the method will always choose the most complex model. Under regularity conditions (see e.g. [Schwarz, 1978; Ghosh et al., 2006, Chapter 4]), logarithm of the above marginal likelihood is BIC + O(1) ˆ j ) − pj log n, where pj is the where BIC (Bayes Information Criterion) = log L(θ 2 ˆ j is the maximum likelihood estimator of θ under model Mj dimension of Θj , θ ˆ j ) is the joint density of x1 , . . . , xn under model Mj for θ = θ ˆj . and L(θ One often assigns probability in a different way in the non-nested case. Let π(θ) be a probability density over Θ = Θ1 ∪ . . . ∪ Θk and p(x|θ) the density of x under θ, θ ∈ Θ. In this case Mj ’s do correspond to just the subset Θj specified Rby Mj and the pj ’s are determined automatically from this identification as πj = π(θ) dθ. It obtains from a simple algebra that the posterior probability Θj P (Mj |x) = P (θ ∈ Mj |x) is given by R p(x|θ)π(θ) dθ Θj . P (Mj |x) = R p(x|θ)π(θ) dθ Θ In this assignment of probabilities, P (Mj′ |x) ≤ P (Mj |x) if for some pair Θj ′ ⊂ Θj . For nested models with Θj ’s of different dimension, any assignment of probabilities of this kind to lower dimensional sets would lead to zero prior and posterior probability. More importantly, this assignment ignores the possibility of the density of θ depending on the model. In the nested case as when θ=effect of a drug the probability distribution of θ under M1 : θ = 0 and M2 : θ ∈ R are often expected to be very different. We present below a concrete example to illustrate some of our points in this discussion. EXAMPLE 6. Let θ=effect of a drug on, say, the (systolic) blood pressure. It is common to test M1 : θ = 0 (no effect) against M2 : θ ≤ 0 (most likely some good effect). This is question of selecting one from two nested models. A typical choice of priors is π(θ = 0|M1 )
=
π(θ|M2 )
= =
1 1 √
2τ 2π 0
θ2
e− 2τ 2 if θ ≤ 0, otherwise
and the density of a single observation x is deifined as p(x|θ)
= N (θ, σ 2 ), θ = 0 under M1
p(x|θ)
= N (θ, σ 2 ), θ ≤ 0 under M2 .
598
Arijit Chakrabarti and Jayanta K. Ghosh
For simplicity we assume σ 2 is known and equal to one while τ 2 = 2(say). Let p1 = p2 = 21 . Given data x = x1 , . . . , xn on n persons, one has
P (M1 |x) =
− 12 ( √12π )n e − 12 ( √12π )n e
n P
i=1
x2i
+
R
θ≤0
n P
i=1
− 21 ( √12π )n e
x2i
n P
i=1
. (xi −θ)2
2
θ √ 1√ e− 4 2 2 2π
dθ
Similarly, P (M2 |x) is given by
P (M2 |x) =
R
− 21
( √12π )n e
θ≤0 n P − 12 x2i
( √12π )n e
i=1
+
R
n P
(xi −θ)2
i=1
− 21
θ≤0
( √12π )n e
√ 1√ e− 2 2 2π n P
i=1
θ2 4
dθ .
(xi −θ)2
2 − θ4
√ 1√ e 2 2 2π
dθ
If on the other hand we wish to test a new model M1 : θ ≤ 0 (bad or no effect) against M2 : θ ≥ 0 (good or no effect) (i.e a non-nested model choice scenario) and θ ∼ N (0, τ 2 = 2) (untruncated normal) one has
P (M1 |x)
=
R
θ≤0
R
θ∈R
P (M2 |x)
=
R
θ≥0
R
θ∈R
− 21 ( √12π )n e
i=1
n P
(xi −θ)2
− 21
n P
(xi −θ)2
n P
(xi −θ)2
n P
(xi −θ)2
dθ and
( √12π )n e
i=1
− 21 ( √12π )n e
i=1
− 21 ( √12π )n e
2
θ 1 √ √ e− 4 2 2π
i=1
2 − θ4
1 √ √ e 2 2π
2
θ 1 √ √ e− 4 2 2π
dθ dθ .
2
θ 1 √ √ e− 4 2 2π
dθ
The BIC model selection rule, as defined above, chooses the Mj that maximizes BIC. It can be shown rigorously that under suitable regularity conditions, as n → ∞ with k and pj ’s held fixed, the BIC rule is asymptotically the same as Bayes rule. Both the Bayes rule and BIC are consistent in the sense that they choose the true model with probability tending to one. In the nested case, the Bayes rule and BIC choose the true model with smallest dimension. Note that in the nested case if Mj is true then Mj+1 , . . . , Mk are also true, in the sense that if θ ∈ Θj , then it also belongs to Θj+1 and so on. Thus BIC is consistent with the generally accepted principle that model chosen should be as simple as possible if there are more than one model which are true or which explain data equally well. It is easy to show with examples that AIC need not do so. In fact, with k = 2 and 0-1 loss, model selection is equivalent to testing H0 : θ ∈ Θ1 vs H1 : θ ∈ Θ2 (or θ ∈ Θ2 ∩ Θc1 ). In this context AIC has a type
AIC, BIC and Recent Advances in Model Selection
599
1 error probability α which is asymptotically greater than 0.05 in many simple examples. This makes it a very non-conservative test. This is one reason why Bayesians don’t seem to like AIC. However, AIC has some good properties like predictive optimality not possessed by BIC, vide Sections 1 and 4. In short, no model selection rule will serve all purposes. Before we end, we quote in the context of nested models, [Forster and Sober, 1994, paragraph 3 of section 7]. In this remark “LIN” refers to the class of all straight lines and “PAR” refers to the class of all parabolic curves of R2 . “The key element of any Bayesian approach is the use of Bayes’ Theorem, which says that the probability of any hypothesis H given any data is proportional to its prior probability times its likelihood : p(H/Data)p(H) × p(Data/H). However, it is an unalterable fact about probabilities that (PAR) is more probable than (LIN), relative to any data you care to describe. No matter what the likelihoods are, there is no assignment of priors consistent with probability theory that can alter the fact that p(PAR/Data) ≥ p(LIN/Data). The reason is that (LIN) is a special case of (PAR). How, then, can Bayesians explain the fact that scientists sometimes prefer (LIN) over (PAR) ?” Rather than trying to respond to the criticism in [Forster and Sober, 1994], we have tried to explain in this section the Bayesian practice and point of view as well as we can. Hopefully, this will help to some extent. 4
COMPARISON OF AIC AND BIC THROUGH AN EXAMPLE.
In this section, we consider AIC and BIC from a comparative point of view. Much research has been done on these two criteria. We try to summarize here (with minimum technicality) the knowledge about where these two criteria are suitabile (or otherwise), with the aid of an illustrative example. It is quite clear now that AIC and BIC are appropriate for different purposes and have different motivations. BIC came out as an approximation (in large samples) to the Bayes rule using a 0-1 loss loss function when the dimensions of the models under study remain fixed. On the other hand, AIC was derived and proposed by Akaike as a rule, which, in large samples, does well in predicting an independent future set of observations based on data at hand, the prediction accuracy being achieved by minimizing the expected Kullback-Leibler divergence between the true model and candidate models. Note that the Bayes rule for 0-1 loss corresponds to choosing the model having the largest a posteriori probability, while minimizing the expected Kullback-Leibler divergence using AIC, for normal linear models, is equivalent to minimizing the expected squared error prediction loss (in predicting an independent future set of observations). It can be said, as Ghosh [2007] points out, that the penalty of BIC is appropriate and arose from the use of the 0-1 loss while that of AIC corresponds to the squared error loss. It indeed turns out, as intuitively somewhat expected from the above discussion, that AIC is a good rule if prediction is the sole purpose and BIC is good if the scientist wants to select the correct model. AIC excels in those problems where all models in the model space
600
Arijit Chakrabarti and Jayanta K. Ghosh
are incorrect or there is at most one correct model in the model space, while BIC excels if the correct model is in the model space. If more than one model is true, AIC may not choose the simplest true model. We will now expand on thoughts of the last paragraph with a simple example. Let us consider the model yi = µ + ǫi , i = 1, . . . , n, where ǫi are identically and independently distributed N (0, 1) errors. We have two models M1 : µ = 0 vs M2 : µ ∈ R. Once a model Mα (α = 1 or 2) is selected, µ has to be estimated by the maximum likelihood estimator under that model, namely µ ˆ(α). Suppose one wants to minimize with respect to α the quantity n 1 P Eµ ( n (zi − zˆi (α))2 ), where zi , i = 1, . . . , n is a future set of n independent i=1
observations from the same distribution, and zˆi (α) = µ ˆ(α) is the estimated value of zi based on the y’s and model Mα , α = 1, 2. It can be shown that minimizing this is the same as minimizing with respect to α the quantity Eµ (µ− µ ˆ(α))2 or (µ− µ ˆ(α))2 for each given y1 , . . . , yn . We will work with the quantity A(α, µ) = Eµ (µ− µ ˆ(α))2 . Note that if the value of µ were known, one would know which one of M1 and M2 will minimize A(α, µ) with respect to α by simply evaluationg A(1, µ) and A(2, µ). Minimizing A(α, µ) with respect to α in this way yields the optimal predictive rule, called the Oracle, as M1 if 0 ≤ µ2 ≤ n1 , Mora (µ) = M2 if µ2 > n1 .
This way the Oracle, prescribes for each value of µ, the model which will be good for predicting that µ. But this rule depends on the unknown (true) µ and hence can’t be used in practice. The expected loss for the Oracle is µ2 if 0 ≤ µ2 ≤ n1 and n1 if µ2 > n1 . The statistician wants to find a model selection rule which does almost as well as the Oracle for a moderate amount of data, and as well as the Oracle in the limit asymptotically for inifinite amount of data. Upon simple calculations, it turns out that AIC chooses M2 if n¯ y 2 > 2 and M1 otherwise, while BIC chooses M2 or M1 based on whether not n¯ y 2 exceeds log n. For AIC, 2 therefore, the loss is (¯ y − µ) if M2 is chosen (i.e. if n¯ y 2 > 2), since µ ˆ(2) = y¯, 2 while it is µ if M1 is chosen since µ ˆ(1) = 0. The loss can be expressed simply as (¯ y − µ)2 In¯y2 >2 + µ2 In¯y2 >2 . Taking expectations, the expected loss for AIC is Eµ {(¯ y − µ)2 In¯y2 >2 } + µ2 Pµ (n¯ y 2 ≤ 2). Similar arguments gives that for BIC the expected loss is Eµ {(¯ y − µ)2 In¯y2 >log n } + µ2 Pµ (n¯ y 2 ≤ log n). Consider first the case when µ = 0, i.e., M1 is true and hence M2 is also true. Then Mora (0) selects M1 , AIC chooses M1 if n¯ y 2 ≤ 2 and BIC chooses M1 if
AIC, BIC and Recent Advances in Model Selection
601
n¯ y 2 ≤ log n. It is clear, noting that under µ = 0, n¯ y 2 ∼ χ2(1) , that AIC chooses model 2 with a fixed non-zero probability (given by the probability of a chi-square random variable with 1 degree of freedom to be less than or equal to 2) for all n, while BIC chooses M1 with probability tending to 1 as n → ∞ (since the probability that a chi-square random variable with one degree of freedom being less than log n tends to 1 as n → ∞). This reinforces a general fact that if two fixed dimensional nested models are true, AIC often times fails to choose the smaller model while BIC excels in choosing the parsimonious true model by penalizing the larger model heavily. For all n, the expected loss of the Oracle rule is 0 while that of AIC is nc where 0 < c < 1 is a constant. The expected loss of BIC is 1 always of a strictly larger order than n3/2 and simultaneously of a strictly smaller 1 order than n(log n)δ for any fixed positive δ, for all large enough n. While talking about orders, we are here following the convention that for two real sequences an and bn , an is of a strictly larger order than bn (and hence bn of a strictly smaller order than an ) if an /bn → ∞ as n → ∞. Thus, in this situation the predictive performance of BIC is slightly better than that of AIC.
Now consider the case when µ 6= 0, i.e only M2 is true. Here for any large enough n the Oracle chooses M2 and with probability tending to 1, both AIC and BIC choose M2 . It is easy to show that the ratios of the expected oracle loss and that of BIC and AIC tend to 1 as n → ∞. But, in the absence of any knowledge of the true model and noting that the sample size is never infinite, a conservative approach to choose between AIC and BIC would be to see which rule minimizes, for any fixed n, the worst possible expected loss, i.e. which α ∈ {αBIC , αAIC } minimizes supµ∈R A(α, µ), where αBIC and αAIC denote, respectively, the model selected by BIC and AIC for the given sample. For the Oracle, this quantity is 1 n . A simple calculation shows that for AIC this quantity is, for all large n, of the form Kn1 for some constant 1 < K1 < 2 while that for BIC is of the form K2 log n where K2 > 0 is finite. Hence for BIC the worst possible rate is a factor n of magnitude higher than that of AIC and the Oracle. This pathology is caused √ log n by the fact that for a given n, if µ is of the order of √ , then BIC still chooses n M1 with non-zero probability, making the second term in the expression for the expected loss large. BIC chooses a lower dimensional model unless the increase in log-likelihood for adding a new parameter is at least log2 n . So, for a fixed large √ log n , the expected loss of BIC for that µ will be a factor of n, for a small µ ∼ √ n magnitude larger than that of AIC.
To sum up, this example shows that the overall predictive performance of AIC is better than that of BIC in this problem, while BIC does a better job of selecting the correct model, notwithstanding its affinity to choose the smaller dimensional model because of its large penalty.
602
Arijit Chakrabarti and Jayanta K. Ghosh
5
RECENT ADVANCES IN MODEL SELECTION
Although AIC and BIC are probably the most popular model selection criteria with specific utility (as described in detail) above, they are not the only solutions to all types of model selection problems. In recent years many other penalized likelihood model selection criteria have been proposed. For example, in cubic spline model fitting, the complexity of the model is taken care of by penalizing lack of smoothness of fitted curve. In LASSO (which penalizes the least squares criterion or the log-likelihhod criterion for normal linear models by the absolute values of the regression coefficients), one wants to select an optimum model in the presence of sparsity (i.e. when most regression coefficients are zero or close to zero). This is particularly useful in high dimensional problems. Fan and Li [2001] propose three desirable properties of a penalty function and choose a non-concave penalty which possesses optimal properties. However non-concavity makes the rule computationally inefficient.indexBIC One can also see [Abramovich et al., 2006] for another approach to handling sparse sequences, by connecting simultaneous testing of hypotheses and optimal rate of convergence. They conjectured in their paper that the modified BIC with per-parameter penalty of the form log(n/p) instead of log2 n would result in a good 2 model selection rule providing optimal estimators in sparse sequences. Here p is the number of non-zero parameters in the model. It is woth pointing here that such a change in the BIC will reduce the difference between AIC and BIC criteria somewhat. This is so since in many real life problems the number of parameters p increases with sample size n (since complex data is is typically modelled with more complex models, i.e those with higher dimension) and if both n and p are large, the difference between the (per-parameter) penalties of AIC and (modified)BIC is expected to be smaller. (It is worth mentioning that the same penalty has been used earlier in [Pauler, 1998; Berger et al., 2003], with the argument that the effective sample size per parameter is np if the model under consideration has p free parameters.) Much of the above theory for model selection is tied up with use of least squares estimates. If one uses Stein type shrinkage estimates, then also the notion of complexity changes a lot. This aspect has not been studied in the literature. In the field of machine learning, particularly in classification problems, people have considered penalized empirical risk minimization, and used deep results from the theory of empirical processes to derive the penalty function which is sort of an upper bound of the error of estimation, defined to be the error in estimating the best approximating model (i.e. the member in the model space closest to the true model). Vapnik [1998, Chapter 6] recommends minimizing the structural risk which is essentially a penalized risk, with the penalty arising in a somewhat different way. Vapnik [1998, Chapter 6] also provides a good introduction to model selection based on Kolmogorov complexity or some approximation to it as in [Rissanen, 1978]. Other recent contributors to these aspects in machine learning are van de Geer, Koltchinskii, Bartlett, Lugosi, etc. The interested reader
AIC, BIC and Recent Advances in Model Selection
603
may search the rich literature on this. A good starting point would be the abstracts of talks presented in a conference on model selection at Oberwolfach in 2005, which are available from the internet at the link http://www.ems-ph.org/ journals/abstract/owr/2005-02-04/2005-02-04-05.pdf. It seems that some form unification of the theory of model selection may be possible in future integrating all these apparently disconnected facts through some more fundamental considerations. This paper has been devoted mainly to discussing several model selection criteria, most of the emphasis being given to AIC and BIC. Typically, once a model is selected, one uses that model to do further inference on the data at hand or future data obtained from similar experiments. Instead one may use a Bayesian model average for estimation or prediction by combining the Bayes estimates under different models with weights proportional to marginal likelihoods of models. Marginal likelihood is defined in Section 3. See [Raftery et al., 1997; Hoeting et al., 1999] for more details. A similar idea in the classical statistical context has been proposed by Hjort and Clasekens [2003], Breiman [2001], etc. These ideas have become very popular among both Bayesians and non-Bayesians. SUMMING UP In the context of model selection, we discusssed AIC and BIC (briefly mentioning some of their proposed modifications) along with the original motivation behind them. We studied concrete applications and theoretical results which reinforce our current big picture summary captured in the text. The BIC is more useful in selecting a correct model while the AIC is more appropriate in finding the best model for predicting future observations. Each of these facts hold under suitable conditions mentioned in the text. We also discussed some recent advances in model selection some of which are of interest to researchers in diverse fields of the scientific spectrum. ACKNOWLEDGEMENT The authors would like to thank Prasanta Bandyopadhyay and an anonymous referee for valuable suggestions leading to substantial improvement of the paper. In the process, it has clarified our own understanding of the subject even though we have not been able to obtain the version of Bayesian model selection that will be wholly satisfactory to philosophers. The authors would also like to thank M. Prem Laxman Das and Sumanta Sarkar for help with latex in drawing Figure 1. BIBLIOGRAPHY [Abramovich et al., 2006] F. Abramovich, Y. Benjamini, D. L. Donoho, and I. Johnstone. Adapting to unknown sparsity by controlling the false discovery rate, The Annals of Statistics
604
Arijit Chakrabarti and Jayanta K. Ghosh
34: 584-653, 2006. [Akaike, 1973] H. Akaike. Information Theory and an Extension of the Maximum Likelihood Principle, in B. N. Petrov and C. Csaki (eds.), Second International Symposium of Information Theory. Budapest: Akademiai Kiado, 267-281, 1973. [Akaike, 1974] H. Akaike. A new look at the statistical model identification, IEEE Transactions on Automatic Control AC 19: 716-723, 1974. [Berger and Pericci, 1995] J. O. Berger and L. R. Pericchi. The Intrinsic Bayes Factor for Linear Models, in Bernardo, J. M. et al. (eds.), Bayesian Statistics 5 London: Oxford University Press, 23-42, 1995. [Berger et al., 2003] J. O. Berger, J. K. Ghosh, and N. D. Mukhopadhyay. Approximations and consistency of Bayes factors as model dimension grows, Journal of Statistical Planning and Inference 112: 241-258, 2003. [Bernardo and Smith, 1994] J. M. Bernardo and A. F. M. Smith. Bayesian Theory Wiley: Chichester, 1994. [Breiman, 2001] L. Breiman. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author), Statistical Science 16: 199-231, 2001. [Burnham and Anderson, 2003] K. P. Burnham and D. R. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, Springer: New York, 2003. [Chakrabarti and Ghosh, 2006a] A. Chakrabarti and J. K. Ghosh. Optimality of AIC in Inference about Brownian Motion, Annals of the Institute of Statistical Mathematics 58: 1-20, 2006. [Chakrabarti and Ghosh, 2006b] A. Chakrabarti and J. K. Ghosh. A generalization of BIC for the general exponential family, Journal of Statistical Planning and Inference 136: 2847-2872, 2006. [Chakrabarti and Ghosh, 2007] A. Chakrabarti and J. K. Ghosh. Some aspects of Bayesian model selection for prediction (with discussion), in Bernardo, J. M. et al. (eds.), Bayesian Statistics 8 . Oxford University Press, 51-90, 2007. [Fan and Li, 2001] J. Fan and R. Li. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, Journal of the American Statistical Association 96: 1348-1360, 2001. [Forster and Sober, 1994] M. Forster and E. Sober. , How to Tell When Simpler, More Unified, or Less Ad Hoc Theories Will Provide More Accurate Predictions, The British Journal for the Philosophy of Science 45: 1-35, 1994. [Gardner, 1997] M. Gardner. Relativity Simply Explained, Dover: New York 1997. [Ghosh, 2006] J. K. Ghosh. Different Role of Penalties in Penalized Likelihood Model Selection Rules - abstract of talk presented at a workshop on Multivariate Statistical Methods at the Indian Statistical Institute, 2006. [Ghosh and Samanta, 2001] J. K. Ghosh and T. Samanta. Model selection - An overview, Curent Science 80: 1135-1144, 2001. [Ghosh et al., 2006] J. K. Ghosh, M. Delampady, and T. Samanta. An Introduction to Bayesian Analysis: Theory and Methods Springer: New York, 2006. [Hastie et al., 2001] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, New York: Springer, 2001. [Hjort and Claeskens, 2003] N. L. Hjort and G. Claeskens. Frequentist Model Average Estimators, Journal of the American Statistical Association 98: 879-899, 2003. [Hoeting et al., 1999] J. Hoeting, D. Madigan, A. Raftery, and C. Volinsky. Bayesian Model Averaging, Statistical Science 14: 382-401, 1999. [Joyce, 1999] J. Joyce. The Foundations of Causal Decision Theory, Cambridge: Cambridge University Press, 1999. [Lehmann and Casella, 2001] E. L. Lehmann and G. Casella. Theory of Point Estimation Springer-Verlag: New York, 2001. [Li, 1987] K. C. Li. Asymptotic optimality for cp , cl , cross validation and generalized cross validation: Discrete index set, The Annals of Statistics, 15: 958-975, 1987. [Mukhopadhyay, 2000] N. D. Mukhopadhyay. Bayesian Model Selection for High Dimensional Models with Prediction Error Loss and 0-1 Loss, Ph.D. Thesis, Purdue University, 2000. [Mukhopadhyay and Ghosh, 2003] N. D. Mukhopadhyay and J. K. Ghosh. Parametric Empirical Bayes Model Selection - Some Theory, Methods and Simulations, in K. B. Athreya et al. (eds.), IMS lecture notes in honor of Rabi Bhattacharya, 2003.
AIC, BIC and Recent Advances in Model Selection
605
[Pauler, 1998] D. K. Pauler. The Schwarz criterion and related methods for normal linear models, Biometrika 85: 13-27, 1998. [Raftery et al., 1997] A. Raftery, D. Madigan, and A. J. Hoeting. Bayesian Model Averaging for Linear Regression Models, Journal of the American Statistical Association, 92: 179-191, 1997. [Rissanen, 1978] J. Rissanen. Modeling by shortest data description, Automatica 14: 465-471, 1978. [Schwarz, 1978] G. Schwarz. Estimating the dimension of a model, The Annals of Statistics 1978: 461-464, 1978. [Shao, 1997] J. Shao. An asymptotic theory for linear model selection, Statistica Sinica 7: 221264, 1997. [Shibata, 1981] R. Shibata. An optimal selection of regression variables, Biometrika 68: 45-54, 1981. [Shibata, 1983] R. Shibata. Asymptotic mean efficiency of a selection of regression variables, Annals of the Institute of Statistical Mathematics 35: 415-423, 1983. [Spiegelhalter et al., 2002] D. J. Spiegelhalter, N. G. Best, B. P. Carlin, and A. van der Linde. Bayesian measures of model complexity and fit (with discussion), Journal of the Royal Statistical Society, Series B, Methodological 64: 583-649, 2002. [Stone, 1977] M. Stone. An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion, Journal of the Royal Statistical Society, Series B, Methodological 39: 44-47, 1977. [Stone, 1979] M. Stone. Comments on model selection criteria of Akaike and Schwarz, Journal of the Royal statistical Society, Series B, Methodological 41: 276-278, 1979. [Vapnik, 1998] V. Vapnik. Statistical Learning Theory, Wiley: New York, 1998. [Wasserman, 2006] L. Wasserman. All of Nonparametric Statistics, Springer, 2006.
This page intentionally left blank
POSTERIOR MODEL PROBABILITIES A. Philip Dawid INTRODUCTION We consider the problem of Bayesian inference about the statistical model from which the data arose. By examining the asymptotic dependence of posterior model probabilities on the prior specifications and the data, we refute the conventional wisdom that such problems of model choice exhibit more sensitivity to the prior than is the case for standard parametric inference. Where improper priors are used, the definition of Bayes factor has been regarded as problematic. We argue that many of the supposed difficulties can be avoided by specifying a single overall prior as a weighted sum of measures on the various model spaces (the whole being defined up to a single arbitrary scale factor), and focusing attention directly on the typically proper posterior distribution across models that this implies. We discuss how to select the weights, both using subjective inputs and using formal rules. For the latter case, we demonstrate the importance of using the Jeffreys prior within each model — a choice that goes a long way towards resolving many of the perceived problems connected with arbitrary scaling constants. We illustrate the general theory by constructing ‘reference posterior probabilities’ for normal regression models, and by analysis of an ESP experiment. 1
MARGINAL LIKELIHOOD
Suppose we are uncertain about our statistical model. Conditional on the validity of any candidate M , we have a specified statistical model for the observable X, depending on a parameter θ, having density p(x | M, θ). The specification, meaning, and dimensionality of θ will typically be different for different models. The prior specification is generally given in two parts: 1. A prior distribution over the various models, model M having prior probability p(M ); and 2. Within each model M , a prior density p(θ | M ) for its parameter θ, conditional on its validity. We shall assume, without explicit comment, appropriate regularity conditions on the model densities p(x | M, θ) and the prior densities p(θ | M ). In particular we assume that p(θ | M ) > 0 everywhere. On observing data x, the posterior probability of model M is given by Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
608
A. Philip Dawid
(1) p(M | x) ∝ p(M ) × L(M ),
where L(M ) is the marginal likelihood of a model M for data x: Z (2) L(M ) ∝ p(x | M, θ) p(θ | M ) dθ = p(x | M ),
(3)
the marginal density at x under M . The proportionality signs in (1) and (2) indicate an unspecified multiplier which does not depend on M ; in (1), this is such as to ensure that the right-hand side sums to unity over all the different models considered. The effect of using equations (1) and (2) is to separate out the rˆoles of the two ingredients of the prior specification, with only the within-model priors entering L(M ). The Bayes factor for comparing any two models is just the ratio of their marginal likelihoods; then the posterior odds for this comparison will be just the prior odds multiplied by the Bayes factor. However, in general we will be interested in a larger, finite or countable, set of models, and then the more appropriate thing to look at is the complete marginal likelihood function (2), defined, up to an irrelevant proportionality constant, as a function of the model label. A Bayes factor is just the ratio of two terms in this likelihood function. We remark in passing that a compelling minimal criterion for any ‘pseudo-Bayes factor’ is that it be related to some full ‘pseudo-marginal likelihood function’ in just this way. This criterion is not satisfied by many suggestions that have been made, including those of Spiegelhalter and Smith [1982] and most of the ‘intrinsic Bayes factors’ of Berger and Perricchi [1996a; 1996b; 1997]. 2
2.1
ASYMPTOTICS
Consistency
An important general property of the posterior distribution (1) is consistency [Dawid, 1992, §6.4]. Suppose that the observables (X1 , X2 , . . .) arrive in sequence. We do not need to assume these observables to be independent and identically distributed. We entertain a finite or countable collection of parametric models {Mj }. Then, under very weak conditions, the prior probability is unity that the posterior distribution given by (1) will converge to a point mass on the true model (or, if there is more than one model containing the true distribution, on the true model of smallest dimension). Under still weaker (virtually vacuous) conditions, using a Bayesian model mixture with weights given by (1) (and with the appropriate posterior distributions within each model) will, with prior probability 1, yield probability forecasts asymptotically indistinguishable from those derived from the true generating distribution. Note that the expression “with prior probability 1” here means, essentially, under any data-generating distribution belonging to any of the models under consideration.
Posterior Model Probabilities
2.2
609
Behaviour of marginal likelihoods
We now describe some further refinements of the above consistency property, through a none too rigorous account of the behaviour of the marginal likelihood L(M ) as the sample size increases. We have to make some more specialised assumptions. For simplicity, we suppose that the observables (X1 , X2 , . . .) are generated independently and identically from some (unknown) distribution Q. Let us fix on an arbitrary one of the models under consideration, M say, and drop it from the notation. We further suppose that, for each value of the parameter θ, which ranges over a Euclidean space of dimension d, the (Xi ) are modelled as independent and identically distributed, with distribution Pθ . We denote the Fisher information for a single observation by I(θ). Let q n (xn ), pn (xn | θ) be the (product) densities for X n := (X1 , . . . , Xn ) generated by Q and Pθ , and ln (θ) = log pn (X n | θ). Define K(Q, M ) := inf θ K(Q, Pθ ), where K(Q, P ) denotes the Kullback-Leibler distance between Q and P . We suppose that this infimum is attained, at say θ∗ . In particular, if Q ∈ M then K(Q, M ) = 0 and Q = Pθ∗ . Defining the marginal density pn (x) as in (3), we can then show the following: THEOREM 1. Under Q, as n → ∞: (i) If Q 6∈ M , 1
(4) log pn (X n )/q n (X n ) = −n K(Q, M ) + Op (n 2 ). (ii) If Q ∈ M ,
n 1 1 + log ρ(θ∗ ) + Un + Op (n− 2 ), (5) log pn (X n )/q n (X n ) = − d log 2 2πe 1
where ρ(θ) = p(θ)/{det I(θ)} 2 . The random variable Un does not depend on the prior density p(θ), and asymptotically Un ∼ 21 (χ2d − d). In particular, the asymptotic expectation of Un is 0. The proof of Theorem 1 is given in Appendix A. We note that the function ρ(·) is the density of the prior distribution with 1 respect to the Jeffreys invariant measure µM satisfying dµM (θ) = {det I(θ)} 2 dθ. In particular, ρ(·) is unaffected by non-linear reparametrisation. We shall term ρ(·) the invariantised prior density over M .
2.3
Behaviour of posterior model probabilities
As we vary the model M , differences in the left-hand side of (4) or (5) give the differences in the log marginal likelihood l(M ) = log L(M ) of the models. These are thus approximated, for large n, by differences of expressions of the form of the right-hand sides of these equations, as appropriate. Incorporating the prior model probabilities p(M ), as in (1), we can thus obtain the posterior probabilities. In particular, suppose we are comparing two models, M1 and M2 , of respective dimensions d1 and d2 . Then:
610
A. Philip Dawid
(a) If K(Q, M2 ) > K(Q, M1 ) (so that, in particular, model M2 does not hold, whereas M1 might or might not), we have (6)
log
p(M1 | x) 1 = n{K(Q, M2 ) − K(Q, M1 )} + Op (n 2 ), p(M2 | x)
so that the data will overwhelmingly favour M1 , the posterior odds ratio being exponential in the sample size n. (b) If both M1 and M2 hold, with say q(x) ≡ p(x | M1 , θ1∗ ) ≡ p(x | M2 , θ2∗ ), (7)
log
p(M1 | x) 1 n p(M1 )ρ(θ1∗ | M1 ) = (d2 − d1 ) log + log + V, p(M2 | x) 2 2πe p(M2 )ρ(θ2∗ | M2 )
where V = Op (1) with asymptotic expectation 0, and, by (??.6), the depen1 dence of V on the prior specification is Op (n− 2 ). In particular, if d1 < d2 then the data will again overwhelmingly favour M1 (although the posterior odds will now only grow as a power of n). If d1 = d2 , the data will not be able to make a definitive choice between the two models. However, it is usual to entertain a set of models which is closed under intersection. In this case, M3 = M1 ∩M2 will be a true model of typically smaller dimension than either M1 or M2 , and it will then be M3 that attracts the overwhelming log marginal likelihood.
2.4
Sensitivity to the prior
It is generally believed that sensitivity to prior specification is a bigger problem for comparing models than for prior-to-posterior inference within a model, and this belief has motivated various attempts to replace ‘true’ marginal likelihoods, or Bayes factors, with supposedly more ‘objective’ variants. However, I consider this conclusion to be seriously misleading. The full posterior distribution over models depends on both prior ingredients (between and within models) and on the data. We need to examine its relative sensitivity to all these inputs. We have already seen that the contribution of each of the prior ingredients remains of order 1 as n → ∞. In case (a) of §2.3 above, when comparing two models at least one of which is false, (6) shows that both ingredients of the prior are completely negligible in comparison with the two leading terms. Hence the choice of prior distribution, both between and within models, is utterly immaterial here. In case (b), of comparing two true models, the prior contributions are still negligible in comparison with the leading term, but now enter into the next, order 1, term in (7). There they are, however, intertwined, both with each other and with the ‘random’ term V , depending only on the data. (If M1 is a submodel of M2 , then asymptotically V ∼ 21 {χ2d2 −d1 − (d2 − d1 )}, by Wilks’ Theorem). Suppose now that M1 is the simplest true model, and M2 the next simplest, both supposed to exist and to be unique. The posterior probability of M1 is then seen to have the form
Posterior Model Probabilities
611
1
(8) p(M1 | X) ≈ 1 − kn− 2 (d2 −d1 ) where the term k, of order 1, involves the order 1 terms in (7) — and so, in particular, the contributions of the prior. How does this degree of sensitivity to the prior compare with that of asymptotic within-model posterior distributions? Consider for simplicity the univariate case, ˆ {−l′′ (θ)} ˆ − 21 ) posterior and let θκ be the κ-quantile of the approximating N (θ, distribution of θ. Then it can be shown [Johnson, 1970] that the true posterior probability that θ ≤ θκ differs from κ by an expression with leading term of the 1 form γn− 2 ; here γ is an order 1 quantity involving the data and the first derivative ˆ From this point of view, we see that the of the log prior density, evaluated at θ. behaviour of the posterior probability distribution over models is in no way more sensitive to the choice of the prior than is the posterior distribution within a model; indeed, whenever d2 − d1 > 1, it is strictly less sensitive. (It is interesting, but of no great significance, that it is the actual value of the log prior density at θˆ that affects the distribution over models, whereas it is its derivative there that affects the distribution within models. The point remains that in each case the posterior will be sensitive to the prior specification, to the order described). If we consider, as seems reasonable, the total posterior probability of all true models (assuming that there is at least one), we find that this behaves as 1−e−nK , with K the minimum, over all false models M , of the Kullback-Leibler distance K(Q, M ). (Here we assume K > 0 — things become much more delicate if K = 0, as can happen when we entertain a countable collection of models.) Not only is the contribution of the prior specification completely negligible, but this probability approaches the ‘true’ value 1 at a much faster rate than that governing the convergence of within-model posterior probabilities. It follows from the above considerations that we can determine a set of distributions for X whose posterior probability of containing the unknown true data1 generating distribution differs from κ by only O(n− 2 ), as follows. First select the most probable model, then construct, within it, the usual approximate κ-region based on asymptotic normality. We do not even have to use the true posterior model probabilities; the wrong prior, or a crude approximation such as BIC, effectively the leading term of (7), will produce the same asymptotics.
3
IMPROPER PRIORS
The dependence of the posterior odds (7) on the prior specifications, though of smaller order than the dominant data-driven term, is nevertheless often of interest. We have (9) with
log
p(M1 | x) 1 n = (d2 − d1 ) log + log Rp + V p(M2 | x) 2 2πe
612
A. Philip Dawid
(10) Rp :=
p(M1 )ρ(θ1∗ | M1 ) , p(M2 )ρ(θ2∗ | M2 )
so that asymptotically the contribution of the priors is completely confined to the single quantity Rp . Correspondingly the Bayes factor, which is the ratio of posterior to prior odds, is governed by (11) Rl :=
ρ(θ1∗ | M1 ) . ρ(θ2∗ | M2 )
Each of these terms is highly sensitive to the ordinate of the invariantised prior density ρ at the maximum likelihood estimate, for each model. This has worried many workers, who, seeking an ‘objective’ solution, would like to be able to use improper prior densities. However, as these are generally only specified up to an arbitrary scale factor, they do not determine a well-defined ordinate value. This has led to a variety of tricks and techniques, leading to revised definitions [Berger and Pericchi, 1996a; Berger and Pericchi, 1996b; Berger and Pericchi, 1997], some of which which may even fail to follow the (prior-insensitive) leading term asymptotics of (7) [Aitkin, 1991; O’Hagan, 1995]. Now it seems a little strange to concentrate attention on the Bayes factor, or the related expression (11), since this incorporates just one of the two equally relevant ingredients of the prior. Instead we shall find it more useful and revealing to focus directly on the posterior odds, and on the associated constituent (10) of expression (9). For a generic model M and value θ of its parameter, let us write P for the particular data-distribution these determine, and define the measure ΓM over distributions in M by: dΓM (P ) = p(M ) p(θ | M ) dθ. The overall prior distribution is thus X (12) Γ = ΓM , M
which is a measure over the full model-space M = {P : P ∈ M, some M }, the union of the various simple model-spaces under consideration. We also introduce (13) γ(P ) = γ(M, θ) := p(M ) ρ(θ | M ). Confined to distributions P ∈ M , γ is thus the (invariantised) density of ΓM with respect to its Jeffreys measure µM . (Note that, although µM was originally introduced as a measure on the parameter-space for M , it can equally be regarded as defined directly over M itself, a space whose ‘points’ are probability distributions; we shall move between such equivalent representations of µM , ΓM , etc. without further comment.) As P ranges over M, then γ is the mixed discrete-continuous prior density of Γ with respect to the underlying measure X (14) µ := µM . M
Posterior Model Probabilities
613
The measure µ is the simple unweighted sum of the Jeffreys measures on the various models, and is a well-defined σ-finite measure over M. This is a ‘natural’ base measure to use in our model-mixture problem. We shall term µ the meta Jeffreys measure for the model collection M. An important property of the meta Jeffreys measure µ is that its value, for any set, is a dimensionless real number. Consequently the density γ must also take such values, since its integral with respect to µ yields probabilities. In terms of the above notation, we have, simply, R p(x | M1 , θ) dΓM1 (θ) p(M1 | x) =R , (15) p(M2 | x) p(x | M2 , θ) dΓM2 (θ)
and
∗ ∗ (16) Rp = γ(PM )/γ(PM ), 1 2 ∗ where PM is the distribution which maximises the likelihood over model M . Now X Γ(M) = (17) ΓM (M ),
(18)
i.e.,
Z
M
M
γ(P ) dµ(P )
=
XZ
γ(M, θ) dµM (θ).
M
For the case of a proper prior, Γ(M) = 1, so that the various summands on the right-hand side (in which each integral is confined to a single model M ) are finite and add to 1. We can, if we like, extract the implied prior model probabilities R p(M ) = ΓM (M ) = γ(P ) dµM (P ), and the within-model-M invariantised prior density ρ(θ | M ) = γ(M, θ)/p(M ), but there is no necessity to do so. An ‘improper’ prior distribution is a σ-finite measure Γ on M, determined up to a single arbitrary scale factor, and such that Γ(M) = ∞. In the cases of interest we will in fact have ΓM (M ) = ∞ for some or all models M . The formal posterior odds is again given by (15), which is perfectly well-defined so long as both integrals in it converge, which will typically be the case so long as the sample size n is large enough. Likewise the quantity Rp which enters into the asymptotic expression (9) for the posterior odds is well-defined by (16). In contrast, we cannot sensibly define either the prior odds or the Bayes factor. For the prior odds could only be p(M1 )/p(M2 ) = ΓM1 (M1 )/ΓM2 (M2 ), both terms in which will typically be infinite, so that their ratio is meaningless. Correspondingly, we cannot normalise ΓM , for either model M , to obtain a meaningful probability distribution for θ conditional on M , and so cannot determine R p(x | M ) = p(x | M, θ) dΓM (θM ), even up to a common scale factor. The same problem affects the expression (11) entering into the asymptotic Bayes factor. However, the problematic infinities cancel when we combine the ingredients to form the posterior odds (15), or the related expression (16).
614
A. Philip Dawid
This analysis suggests that many of the perceived difficulties of defining Bayes factors with improper priors may be due to an inappropriate formulation of the problem. If we concentrate attention instead on the posterior distribution across models, which is after all what we are interested in, then this can be perfectly welldefined, even when Bayes factors are not. This is exactly what we are used to for within-model inference, where we can readily determine a formal posterior odds that θ belongs to a set S, even though it may not be possible sensibly to define either the prior odds on this, or a relevant likelihood ratio. In this within-model case, we are happy to specify ‘ignorance’ by means of a measure, rather than a ‘likelihood’, over the full set of possible distributions. Why then should we try to do otherwise when the model is also uncertain? 4
SUBJECTIVE MODEL WEIGHTS
Suppose that we have decided that the prior measures ΓM should be of the form w(M ) ΛM , for some specified (perhaps improper) σ-finite measures ΛM on the various models M . It remains to choose the prior weights w(M ). This can be regarded as a ‘subjective/objective Bayes compromise’, in that we may be using ‘objective’ criteria to choose the within-model base prior measures ΛM , but allow subjective input into determining the weights. Note that the weights w(M ) are not prior probabilities, unless the ΛM are proper and integrate to 1; nevertheless, they affect posterior odds in essentially the same way as do prior probabilities in the proper case. Note also that, if the values that the (ΛM ) take are not dimensionless, then the weights w(M ) will have to be appropriately dimensioned to adjust for this. For example, in a normal model for a length X, with unknown mean µ and standard deviation σ, both these parameters must also have the dimensions of length. The standard (right-invariant) improper prior measure Λ has density element dΛ(θ, σ) = σ −1 dµ dσ. If we change our units from, say, centimetres to inches, the values taken by Λ will scale in the same way: they themselves have the dimension of length. Consequently, any weight attached to Λ must have the dimensions of reciprocal length, in order to ensure that resulting integrals are dimensionless. Although the improper overall prior distribution does not assign finite ‘probabilities’ to the different models, it does so to bounded subsets of the model spaces. In particular, for two models M1 and M2 , and bounded subsets A1 and A2 of their respective parameters spaces, we have: (19)
p(M1 holds and θ1 ∈ A1 ) w(M1 ) ΛM1 (A1 ) = . p(M2 holds and θ2 ∈ A2 ) w(M2 ) ΛM2 (A2 )
By adjusting the ratio w(M1 )/w(M2 ), we can thus ensure that this agrees with subjective assessments as to the relative plausibility of the two events “M1 holds and θ1 ∈ A1 ” and “M2 holds and θ2 ∈ A2 ”. In principle we could make many such assessments, varying the pairs of models and the sets in the spaces which we consider. However, we should strive to make
Posterior Model Probabilities
615
these self-consistent. If, for fixed models M1 and M2 , we consider various different choices for the pair (A1 , A2 ), we should want our assessments to agree with a common choice for the ratio w(M1 )/w(M2 ). Further, for any three models M1 , M2 and M3 , we should require the separate assessments of their pairwise ratios to satisfy {w(M1 )/w(M2 )} × {w(M2 )/w(M3 )} = w(M1 )/w(M3 ). While ensuring such self-consistency is a demanding task, it is not in principle different from any other case where we need to establish a prior distribution which will be a satisfactory approximation to real subjective beliefs. If it does not seem possible to accomplish this, it may be necessary to rethink the initial specifications of the ΛM . A related idea is to consider the implied (typically still improper) predictive distributions for some future observation or observations Y . Thus we might address attention to comparisons of the form R p(M1 holds and Y ∈ A1 ) w(M1 ) p(Y ∈ A1 | M1 , θ1 ) dΛM1 (θ1 ) R (20) = , p(M2 holds and Y ∈ A2 ) w(M2 ) p(Y ∈ A2 | M2 , θ2 ) dΛM2 (θ2 )
again adjusting the ratio w(M1 )/w(M2 ) to bring about agreement with prior assessments (and again striving for self-consistency). This method has the advantage that the sets A1 and A2 belong to the same space. Indeed, the method will typically be most intuitive and useful when we choose A1 = A2 = A say. In the limit, as A shrinks to a single point y, the left-hand side of (20) reduces to the posterior odds for comparing the two models consequent to observing Y = y: R w(M1 ) p(y | M1 , θ1 ) dΛM1 (θ1 ) p(M1 | Y = y) R = (21) . p(M2 | Y = y) w(M2 ) p(y | M2 , θ2 ) dΛM2 (θ2 ) We may thus evaluate w(M1 )/w(M2 ) by choosing a suitable hypothetical future observation Y = y and assessing the resulting posterior odds directly. 5
‘FULLY OBJECTIVE’ FORMAL PRIORS
Suppose now that we wish to exclude all prior input, and so choose the weights, as well as the within-model measures, by purely formal means. That is, we seek a ‘reference’ measure Γref , over the union M of all the models under consideration. Note that we use the terms “reference prior” or “reference measure” in a generic sense, rather than in the more specialised technical sense of Berger and Bernardo [Bernardo, 1979; Berger and Bernardo, 1996].
5.1
Choice of reference measure
Let γ ref denote the invariantised density of Γref , with respect to the meta Jeffreys measure µ. How might we select γ ref ? Recall that any such density takes values which are pure dimensionless numbers. It is difficult to conceive of any argument that could justify taking γ ref (M, θ) 6= γ ref (M, θ′ ), for two parameter-values θ, θ′
616
A. Philip Dawid
in the same model M . Consequently we shall restrict attention to the case that γ ref (M, θ) is a constant, wref (M ) say, for all such θ. This corresponds to using a prior measure of the form X (22) Γref = wref (M ) µM , M
with wref (M ) serving to weight the Jeffreys prior µM for model M . The formal posterior model probabilities will then be (23) p(M | x) ∝ wref (M ) p¯M (x) where (24) p¯M (x) :=
Z
p(x | M, θ) dµM (θ)
is the (typically improper) ‘reference marginal density’ of X under M . By (9) and (16), we have the asymptotic behaviour (25) log
p(M1 | x) 1 n = (d2 − d1 ) log +κ+V p(M2 | x) 2 2πe
when both models M1 and M2 hold, where now κ = log{wref (M1 )/wref (M2 )}.
5.2
Reference model probabilities
How should we set the weights wref (M ) — recalling that these are dimensionless real numbers? The simplest proposal is just to take equal weights, wref (M ) ≡ 1 for all M , equivalent to using the meta Jeffreys measure µ itself as the reference prior distribution. This leads to the formula (26) pref (M | x) ∝ p¯M (x), with p¯M (x) given by (24), µM being the Jeffreys measure on model M . We shall (somewhat arbitrarily) fix on this as our basic definition, thus calling posterior probabilities constructed according to (26) the reference posterior model probabilities. Note that, with this choice, we have κ = 0 in (25).
5.3
Adjusting the weights
An important property of the reference posterior distribution (26) over models is that it is genuinely easy to use it for reference, rather than as a solution in itself. If, instead of the implicit weights wref (M ) ≡ 1 in (22), we wish to use alternative weights w(M ), we can readily adjust the reference probabilities: (27) (28)
p(M | x) ∝ w(M ) p¯M (x)
∝ w(M ) pref (M | x).
Posterior Model Probabilities
617
We have already discussed subjective methods of assessing the w(M ). We can also consider ‘formal’ choices other than w ≡ 1. For example, it might be thought appropriate to make a ‘correction for dimensionality’, so taking the weights to have the form (29) w(M ) = g(dM ) for some function g(·), where dM is the dimensionality of model M . An appealling choice for the function g is (30) g(d) = k d for some dimensionless constant k. This leads to (31) p(M | x) ∝ k dM p¯M (x), with asymptotics (32) log
1 n p(M1 | x) = (d2 − d1 ) log + V. p(M2 | x) 2 2πk 2 e 1
The choice k = (2π)− 2 has been suggested by Wasserman [2000] as that producing closest agreement with the Jeffreys-Schwarz Bayesian Information Criterion (BIC), which is itself based on the approximation − 12 d log n for A in (??.1) (compare (??.9)). However, there seems to be no compelling reason to regard one choice rather than another for the omitted O(1) terms as more ‘objective’; and, as always when considering sensitivity to the prior, in (25) the effect κ of the choice of weights is asymptotically unimportant in comparison with the driving term 21 (d2 −d1 ) log n (at any rate, for comparing models of different dimensions).
5.4
Reference marginal likelihood
Although the prior weights w(M ) in (27) are not prior probabilities, they behave much as if they were, and can be given a similar interpretation. Likewise, although p¯M (x) is not a proper marginal likelihood for M , it too behaves much as if it ¯ were. Thus we may reasonably refer to L(M ) ∝ p¯M (x) as the reference marginal likelihood over models M , based on data x. Now suppose that we have a single parametric model M , with parameter θ, say, and are interested in a function φ of θ. Each hypothetical value for φ defines a submodel Mφ of M , and so we can use the above analysis to construct a reference marginal likelihood for φ. This is thus defined by: Z ¯ (33) L(φ) ∝ p(x | θ) dµ(θ | φ), where µ(· | φ) is the Jeffreys measure over the model Mφ . (In forming µ(· | φ), we must not, of course, discard any ‘proportionality constant’ that depends on φ.) Then (33) is fully invariant under reparametrisation.
618
A. Philip Dawid
How are we to use such a marginal likelihood? Naturally, by combining it with a set of prior weights w(M ), according to (27). But we must be careful in assigning these weights, taking into account that they should not be interpreted as probabilities. The suggestions of §4 and §5.3 could be useful here, but the problem of interpretation of the weights deserves still further attention.
5.5
Invariant and non-invariant priors
The absolute invariance of the Jeffreys measure ensures that it is dimensionless. In particular, and essential for our purposes, it is not plagued by the problem of an unspecified multiplier: twice the Jeffreys measure is not a Jeffreys measure. In this respect at least, even though it may not be normalisable, it behaves like a proper probability distribution. Consider first the model, for an observable X, which is normal with unknown mean θ and known variance σ 2 . The Jeffreys measure element for this model is (34) σ −1 dθ, which is dimensionless, both θ and σ being measured in the same units as X. This density element is often described simply as “proportional to dθ”, with the scaling constant σ −1 omitted, but this is misleadingly imprecise; the problem of determining an appropriate overall scale to apply to the measure element dθ would then be exacerbated by the need to use a divisor with the same dimensions as X — a constraint which is all too rarely appreciated. If we now suppose the variance to be unknown also, the Jeffreys measure element becomes √ (35) 2 dθ dσ/σ 2 , which is again (as it√must be) dimensionless. Note also the fully determined numerical multiplier 2. Other location or location-scale families would yield Jeffreys measures essentially the same as (34) or (35), but with possibly different numerical multipliers. √ We can decompose the bivariate Jeffreys measure 2 dθ dσ/σ 2 in a natural way, −1 as the product of the (uniquely defined) √ conditional Jeffreys measure σ dθ for θ given σ, and the remaining term 2 dσ/σ, which term may thus be thought of √ as defining R the ‘marginal Jeffreys measure’ for σ; note that the expression 2( dσ/σ 2 ) dθ which would formally define such marginalisation involves an infinite integral, and is thus meaningless. This should be contrasted with the more √ ‘obvious’, but non-invariant, decomposition into dθ and 2 dσ/σ 2 . (We remark, however, that such ‘natural’ decomposition of the Jeffreys measure is not always possible. Even in the normal case, we cannot decompose the bivariate√Jeffreys measure into the conditional Jeffreys measure for σ given θ, which is 2 dσ/σ irrespective of θ, and any marginal measure for θ.) By contrast, other improper prior distributions (e.g. right Haar measure, Berger– Bernardo reference priors, . . . ) which have been suggested — even though they
Posterior Model Probabilities
619
may be ‘relatively invariant’ under scale or other relevant transformations — are typically not fully invariant. In particular, they are not dimensionless, and so can not come ready supplied with a natural overall scale. For instance, the improper prior commonly suggested for the above two-parameter normal model is that with element “proportional to dθ dσ/σ”. However, we recall that dθ dσ/σ has the same dimensions as X. Without any further quantity in the problem which could set a yardstick for defining ‘unit length’, there is nothing we can use to eliminate this dimensional dependence. Correspondingly, there can be no universal formal prescription for choosing the arbitrary multiplier, since this has to be a dimensioned quantity. Likewise, there can be no natural decomposition of such a bivariate measure into (say) a conditional measure for θ given σ and a marginal measure for σ. It is for the above reasons that we do not consider it as generally appropriate to use other improper priors than the Jeffreys measure for purposes of ‘fully objective’ formal model comparison. There may be some cases where this constraint can be relaxed. For example, if all our models are location-scale models, with location parameter θ and scale parameter σ, we might choose to use the common measure k dθ dσ/σ, where the choice of the (dimensioned) constant k will not affect model comparisons. (Even here, there is no obvious reason not to allow the scaling constant k to vary across such models, similar to the case for the Jeffreys prior.) In some other problems there may be, in each model M , a function of its parameter θM which corresponds to a common externally defined quantity φ. One might then combine an essentially arbitrary common measure for φ in each model with a suitable reference measure for θM given φ — which, for the reasons discussed above, will generally be the conditional Jeffreys measure, and which may in any case incorporate a ‘proportionality constant’ depending on φ. 6
PROBLEMS WITH THE JEFFREYS MEASURE
There are nevertheless some difficulties associated with the use of the Jeffreys measure.
6.1
Problems of sample size
We have concentrated attention on problems where the observations are independent and identically distributed. In this case, the Jeffreys measure appears well determined, independent of the sample size. Even in this special problem, however, this is not really the case, as the following example illustrates. Example.
Consider the model:
(36) M1 : Xi ∼ N (θ, σ 2 ),
i = 1, . . . , n,
independently, σ 2 being known. We wish to compare this with an alternative model M2 , which postulates θ = 0 in (36). Suppose first that n = 1. The Jeffreys
620
A. Philip Dawid
prior for M1 , σ −1 dθ, leads to p¯1 (x) = σ −1 , for any x, and so the reference posterior odds in favour of M2 is (37) Ω :=
1 p(x | θ = 0) = (2π)− 2 exp −(x2 /2 σ 2 ). p¯1 (x)
If we now have a sample of size n, it seems that we should, without loss, be able ¯ ∼ N (θ, τ 2 ), where τ 2 = σ 2 /n to reduce the data to the single sufficient statistic X , and where again θ = 0 under M2 . Applying (37) to this now 1-dimensional case would then yield the reference odds 1
¯2 /2 σ 2 ). Ωn := (2π)− 2 exp −(n x ¯ 2 / σ 2 ∼ χ2 , for any n. So the distribution of Ωn does not Now, under M2 , n X 1 depend on n. This result also follows directly from the complete invariance of the problem under a change of scale, which is exactly the effect that increasing the ¯ However, this seems to contradict (25), sample size n has on the model for X. 1 which would require Ωn = 2 log n + Op (1). How can this be? The above result does not require the initial reduction by sufficiency, and would also ensue if we used the model for all the data —together with the Jeffreys prior for this model. And there’s the rub: the Jeffreys prior based on the model for ¯ is not independent of n: it is in (X1 , . . . , Xn ) (or, equivalently, on that for X), 1 −1 fact n 2 σ dµ. And, if we change the prior with n in this drastic way as the data accrue, of course we will upset the asymptotics (which assume a fixed prior). So we need to find a way to adjust for this effect of sample size on the Jeffreys measure: for example, by basing it only on the model for a single observation — always supposing we can give meaning to this phrase. In general, the Jeffreys measure µsM associated with a sequence of s independent and identically distributed observations from a model M , of dimension dM , satisfies (38) µsM = sdM /2 µM . Now if our observables (Xi : i = 1, . . . , n) form a random sample of size n, we can just as well think of them as a random sample (Yj : j = 1, . . . , n′ ), of size n′ = n/s, of ‘clusters’, with Y1 denoting the cluster (X1 , . . . , Xs ), etc.. From this point of view, the appropriate Jeffreys measure would appear to be that associated with the model for a single cluster, i.e. µsM , rather than µM . When we are comparing models of differing dimensions, the choice of s will affect the definition of our base measure; our previous construction now yields, rather P than µ or P anything proportional to it, a different meta Jeffreys measure µs := M µsM = M sdM /2 µM . This change will in turn affect any proposed formal construction of a reference prior (such as the prescription: take γ ref ≡ 1). Note that the density γ of µs with respect to µ is of the form (30), which provides additional interpretation of, and justification for, that family of formal densities. Now we have
Posterior Model Probabilities
(39) µs =
X s dM /2 M
n
621
µ∗M ,
µ∗M
where = µnM denotes the Jeffreys measure based on the model M for all the data. Thus, to construct the meta Jeffreys measure relevant to taking observations in clusters of size s, we could first find the {µ∗M }, and then apply (39). This is a better way of thinking of things, since the definition of µ∗ is clear-cut, independent of any arbitrary choice as to cluster size. We can also extend (39), without difficulty, to apply to models in which the variables need not be taken as independent and identically distributed. There remains the choice of cluster size s. The ‘natural’ choice is s = 1, viz. X (40) µ1 = n−dM /2 µ∗M , M
which, in the independent and identically distributed case, recovers the original definition (14), and can also be applied more generally. We shall fix on this choice as our general reference prior distribution, and again refer to formal posterior model probabilities constructed from it as the reference posterior model probabilities, pref (M | x). These are thus given by: Z (41) pref (M | x) ∝ n−dM /2 p(x | M, θ) dµ∗M (θ),
with µ∗M the Jeffreys prior, for model M , based on all the data. Even this definition is not unproblematic, however. In more complex problems, such as hierarchical models, it may be far from clear how the ‘sample size’ n should be defined. This problem is related to, but distinct from, that of defining the dimensionality dM in such problems [Pauler, 1998]. Even in the independent and identically distributed case, the definition of n depends on the identification of “a cluster of size one”, which is itself not wholly unambiguous. More difficulties arise when the sample size, instead of being fixed in advance, is determined by the application of a stopping rule to sequentially generated data: this complicates both the interpretation of n and the appropriate definition of the Jeffreys measure. See §7.2 below.
6.2
Marginal likelihood
Consider again the normal model of (36), with now both θ and σ unknown. The Jeffreys prior for θ given σ has element dµ(θ | σ) = σ −1 dθ, so that (33) gives the following reference marginal likelihood for σ: 2 ¯ (42) L(σ) ∝ σ −n e−R/2σ ,
P with R the usual residual sum-of-squares, i (xi − x ¯)2 . This is identical to the ‘profile likelihood’ [Kalbfleisch, 1986], and suffers from the same problems: for example, its maximum is at σ ˆ 2 = R0 /n, and so it does not take account of the
622
A. Philip Dawid
reduction in the degrees of freedom from estimating θ. This equivalence to profile likelihood also holds for more general normal regression problems, where the ‘incorrect’ degrees of freedom can be more bothersome; and extends to problems such as that of Neyman and Scott (Example 3 of Berger and Bernardo [1996]), where the number of mean parameters increases with the sample size, leading to inconsistent estimation of σ 2 . Another difficulty with reference marginal likelihood is its susceptibility to marginalisation paradoxes [Dawid et al., 1973]. (We recall that it is not in general appropriate to combine the reference marginal likelihood (42) with a prior probability distribution over φ, but rather with a weight distribution. However, this does not affect the logic of the marginalisation paradox.) In particular problems we may be able to justify integrating out the unwanted components of the parameter θ with respect to some non-Jeffreys conditional distribution for θ given φ, the parameter of interest. This might, at its simplest, simply involve using a different choice of ‘normalising constant’. In the above example, we could consider such adjusted priors as k(σ/σ0 ) σ0−1 dθ, where σ0 is a ‘natural scale’ in the problem, having the same dimensions as X; and k(·) is some dimensionless positive real function of a dimensionless real argument. For the special case that k has the form k(x) ≡ xα , for some real α, the choice of σ0 will not affect the analysis. This would then allow other popular improper priors, for example right-Haar measure, or Berger-Bernardo reference priors, some of which might exhibit better behaviour. We remark, however, that in general it is not possible to eliminate all marginalisation paradoxes in this way.
6.3
Proper Jeffreys priors
In some problems, the Jeffreys measure turns out to be finite. However, that does not mean that it is a probability distribution, since its total mass need not be 1. Consider, for example, the case of a Binomial random variable X, with index n and unknown probability parameter θ ∈ (0, 1). The Jeffreys measure (reduced 1 1 to n = 1) has density element θ− 2 (1 − θ)− 2 dθ, and thus total mass π; it is proportional to, but not identical with, the Beta distribution β( 12 , 21 ). Should we use the actual Jeffreys measure in this case, or replace it by the associated probability distribution? I would argue that, for the purposes of ‘objective’ model comparison, there is nothing to be gained by rescaling to unit mass (a possibility that does not even exist for most models), and that the actual Jeffreys measure should be used in (24), (26), etc.
6.4
Dilution and vagueness
The problem of ‘dilution’ [George, 1999] arises when similar or identical models are present in our model collection M. For example, model M1 might express the observables (Xi ) as jointly normal, with common mean, variance, and correlation (between any two distinct variables); M2 might model them as Xi = Z + Yi , with Z and the (Yi ) independent normal, the (Yi ) being identically distributed. These
Posterior Model Probabilities
623
two models induce exactly the same family of distributions for the observables. If they are both included in M, it would, in effect, lead to ‘double counting’. Cases with similar rather than identical models arise, say, when we consider which independent variables to include in a regression problem with highly correlated regressors. The dilution problem is not confined to use of the Jeffreys measure, but can affect any ‘automatic’ method of assigning posterior probabilities over models. Suitable additional adjustment for this effect thus needs to be made. A specific difficulty with the Jeffreys measure arises if we try to allow for some ‘vagueness’ in a model. Thus, instead of a model M under which the observables Xi are independent and identically normally distributed with unknown mean µ and known standard deviation σ = 1, we might consider M ′ , under which σ is allowed to range between 0.99 and 1.01. Under M the Jeffreys measure element for θ is just dθ; but under M ′ , which Rcould√reasonably be expected to have higher weight, 1.01 it is 0.028 dθ (where 0.028 = 0.99 2 dσ/σ 2 ). Again, some further adjustment is required. 7
EXAMPLES
We now present some simple examples to help clarify some of the properties — both good and bad — of reference posterior probabilities.
7.1
Regression model
Consider a normal regression model M for observable Y : (43) Y ∼ N (Xθ, h−1 Σ), where Y is (n × 1), θ is (p × 1), and the precision h is unknown, along with θ, whereas Σ, of full rank, is known (e.g. Σ = In ). Thus the dimensionality of this model is d = p + 1. It is readily verified that the Jeffreys measure µ∗ , based on the model for all the data, has density element n 21 1 X T Σ−1 X 2 dθ h 21 p−1 dh. (44) dµ∗ = 2
We note that formula (44) is invariant under non-singular linear transformations both of response and of predictor variables, as well as under rescaling of Σ. Adjusted to a sample size of 1, we thus get Jeffreys measure dµ
(45)
1
21 (p+1) 1 = dµ∗ n =
1
1
1
2− 2 (det A) 2 dθ h 2 p−1 dh,
where A := n−1 X T Σ−1 X will typically be of order 1 as n increases. Using µ1 in (24) yields
624
(46) p¯(y) = cn
A. Philip Dawid
2π n
21 p
1
1
(det Σ)− 2 R− 2 n ,
√ where cn = Γ( 12 n)/ 2π n is the same for all such regression models, and R := y T {Σ−1 −Σ−1 X(X T Σ−1 X)−1 X T Σ−1 }y is the residual sum of squares. For this to be finite (although still not normalisable), we require R > 0, and thus n > rank X. From (46) (or equivalently directly from (41)), we then obtain the reference posterior model probabilities: 21 pM 2π 1 −1n (47) p(M | y) ∝ (det ΣM )− 2 RM2 . n For comparing two regression models with the same dispersion structure Σ but different mean-structure, the reference posterior odds is given by 21 (p1 −p2 ) 21 n 2π p(M1 | y) R2 = (48) . p(M2 | y) n R1
If M2 is a submodel of M1 , we can rewrite this in terms of the F -statistic for testing M1 within M2 , using R2 (p1 − p2 ) =1+ F. R1 (n − p1 )
When, further, both models are true, F will be bounded in probability, having a 1 limiting χ2p1 −p2 /(p1 − p2 ) distribution. Then (R2 /R1 ) 2 n ∼ exp{ 12 (p1 − p2 ) F } as n → ∞, and the agreement with the asymptotic form (25), with κ = 0, is readily confirmed. Known precision A similar analysis may be conducted in the simpler case that the dispersion matrix of Y is fully specified in each model, so that we can take h = 1. In this case, for 1 p > 0 the appropriate Jeffreys measure is (det A) 2 dθ, and the reference marginal density becomes 21 p 2π 1 − 12 (49) p¯(y) = (det 2πΣ) e− 2 R . n No free parameters An interesting feature now arises when we consider the case p = 0, so that the model specifies a completely fixed distribution, P0 say, with no unknown parameters at all. The standard routine for determining the Jeffreys measure is not now applicable. Clearly, any measure contemplated must confine its mass to P0 — but what mass should it put there? The answer “1” appears natural and desirable, but by no means obviously correct (particularly in the light of §6.3). Now if we formally put p = 0 in (49), we obtain
Posterior Model Probabilities 1
625
1
(50) p¯(y) = (det 2πΣ)− 2 e− 2 R . We see that this is, not merely proportional to, but identical with, the joint density p0 (y) of Y under P0 . That is, it corresponds to integrating over the ‘model’ {P0 } with respect to the measure putting mass 1 on P0 . This roundabout argument thus justifies taking the appropriate Jeffreys measure in this case to be the degenerate probability measure at P0 . Arguing by analogy (perhaps a dangerous enterprise), we might propose more generally that, for any model M0 which fully specifies a single distribution P0 for the data, the associated ‘Jeffreys measure’ be taken as the unit mass on that distribution, thus yielding p¯M0 (x) ≡ p0 (x) in (24), and, in (41), pref (M0 | x) ∝ p0 (x).
7.2
ESP Experiment
In an experiment to investigate the existence of extra-sensory perception [Jahn et al., 1987; Jefferys, 1990], a sequence of n = 104, 490, 000 trials, taken as Bernoulli with parameter θ, resulted in a data-sequence x containing r = 52, 263, 471 successes. The maximum likelihood estimate of θ is θˆ = 0.500177. The two models considered are M1 : 0 < θ < 1, with dimension d1 = 1, and M2 : θ = 0.5, with dimension d2 = 0 (the null hypothesis of no ESP). Although θˆ appears very close to the null value, with this huge sample size a traditional hypothesis test leads to rejection of the null hypotheses at conventional levels of significance: the approximate standard normal deviate is 3.614, the corresponding two-tailed P -value being less than 0.0003. Binomial sampling We first take the ‘natural’ approach of assuming that the total number n of trials was fixed in advance, so that r is an observation from the binomial distribution Bin(n; θ). The corresponding unadjusted Jeffreys measure element under M1 , based on the full data x, is 1
1
1
(51) dµ∗ (θ) = n 2 θ− 2 (1 − θ)− 2 , and when standardised to n = 1 becomes 1
1
(52) dµ1 (θ) = θ− 2 (1 − θ)− 2 . As suggested by §7.1, we take the Jeffreys measure over M2 to be the unit mass. Then application of (41) (using Stirling’s approximation to the Gamma integral under M1 ) leads to reference posterior odds (53) (54)
pref (M2 ) pref (M1 )
p2 (x) p1 (x | θ) dµ∗1 (θ) p2 (x) = R p1 (x | θ) dµ1 (θ) = 5.948, 1
= n2 R
626
A. Philip Dawid
expressing a preference for the null hypothesis M2 . This is in general accordance with other Bayesian analyses of this problem. Note that the meta Jeffreys measure for this problem (for n = 1) has element 1 1 {δ(0) + θ− 2 (1 − θ)− 2 }dθ, where δ denotes the Dirac delta-function. This has finite total mass 1 + π, and is thus proportional to the proper prior probability structure with p(M1 ) = π/(1 + π) p(M2 ) = 1/(1 + π) θ | M1 ∼ β( 21 , 12 ). Thus in this particular case use of this proper ‘reference prior structure’ would give the same reference posterior odds (54). Negative Binomial sampling Now suppose that it was r, rather than n, that was fixed in advance, so that n is regarded as an observation from the negative binomial distribution NegBin(r; θ). This affects the Jeffreys measure element which, unadjusted, is now 1
1
(55) dµ∗1 (θ) = r 2 θ−1 (1 − θ)− 2 . Although (55) differs from (51), we see that they are essentially identical in the ˆ i.e. where the likelihood is non-negligible. (This is a general region of θ = θ, feature, at least for exponential families: the Fisher information evaluated at the maximum likelihood estimate is always the same as the observed information, and this latter is not affected by the stopping rule.) Consequently the integral in (41) is effectively unchanged — in fact, the reference odds, calculated exactly as in (53), is now 5.951. (However, because (55) is not normalisable, there is now no corresponding proper ‘reference prior structure’.) We see then that the well-known dependence of the Jeffreys prior on the stopping rule, which might seem prima facie to be problematic, is of no essential practical importance in our approach. There is, however, another, quite distinct problem that arises: how should we specify the appropriate scaling factor under M1 ? After all, we now have observed, not n independent Bernoulli trials, but r independent geometric trials. Slavish application of our general recommendations would appear to imply that we should 1 1 scale (55) by r− 2 rather than n− 2 as in §7.2, so renormalising it to a single observation which is, however, now geometric rather than Bernoulli. If we did 1 so, the reference posterior odds would be multiplied by θˆ 2 = 0.707, leading to a value of 4.21, corresponding to (slightly) weaker evidence against ESP. Common sense, however, suggests that the stopping rule should not have such an arbitrary 1 influence on the answer, and that we should continue to scale (55) by n− 2 — even though the ‘sample size’ n is data-dependent.
Posterior Model Probabilities
8
627
CONCLUSION
We have indicated how some, at least, of the difficulties of conducting ‘reference’ analyses, when the model is uncertain, can be eliminated by concentrating attention directly on posterior model probabilities, rather than on marginal likelihood or Bayes factors. A tentative case has been made for the use of the ‘meta Jeffreys’ prior and the associated reference posterior model probabilities. However, we have also pointed out a number of difficulties associated with this proposal. Further work will be needed to try and produce a general method which shares the desirable properties (invariance, etc.) of the meta Jeffreys prior, while avoiding its undesirable features (in particular, its problematic dependence on sample size and stopping rule). It remains to be seen whether these desiderata can be made mutually compatible. ACKNOWLEDGMENTS I am grateful to Merlise Clyde, Susie Bayarri and Bertrand Clarke for valuable discussions on the material in this paper, to Jos´e Bernardo for suggesting that I analyse the example of §7.2, and to an anonymous referee for helpful suggestions. BIBLIOGRAPHY [Aitkin, 1991] Murray Aitkin. Posterior Bayes factors (with Discussion). Journal of the Royal Statistical Society, Series B, 53:111–142, 1991. [Berger and Bernardo, 1996] James O. Berger and Jos´ e-Miguel Bernardo. On the development of reference priors (with Discussion). In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 4, pages 35–60, Oxford, 1996. Clarendon Press. [Berger and Pericchi, 1996a] James O. Berger and Luis Raul Pericchi. The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association, 91:109– 122, 1996. [Berger and Pericchi, 1996b] James O. Berger and Luis Raul Pericchi. The intrinsic Bayes factors for linear models (with Discussion). In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 5, pages 25–44, Oxford, 1996. Clarendon Press. [Berger and Pericchi, 1997] James O. Berger and Luis Raul Pericchi. On criticisms and comparisons of default Bayes factors for model selection and hypothesis testing. Technical Report 97-43, Institute of Statistics and Decision Sciences, Duke University, 1997. [Bernardo, 1979] Jose-Miguel Bernardo. Reference posterior distributions for Bayesian inference (with Discussion). Journal of the Royal Statistical Society, Series B, 41:113–147, 1979. [Clarke and Barron, 1990] Bertrand S. Clarke and Andrew R. Barron. Information-theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory, 36:453–71, 1990. [Clyde, 1999] M. A. Clyde. Model averaging and model search (with Discussion). In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, eds., Bayesian Statistics 6, pp. 157–185. Clarendon Press, Oxford, 1999. [Dawid et al., 1973] Alexander Philip Dawid, Mervyn Stone, and James Victor Zidek. Marginalization paradoxes in Bayesian and structural inference (with Discussion). Journal of the Royal Statistical Society, Series B, 35:189–233, 1973. [Dawid, 1992] Alexander Philip Dawid. Prequential analysis, stochastic complexity and Bayesian inference (with Discussion). In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 4, pages 109–125, Oxford, 1992. Clarendon Press.
628
A. Philip Dawid
[George, 1999] Edward I. George. Discussion of Clyde [1999]. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 175–177, Oxford, 1999. Clarendon Press. [Jahn et al., 1987] R. G. Jahn, B. J. Dunne, and R. D. Nelson. Engineering anomalies research. J. Scientific Exploration, 1:21–50, 1987. [Jefferys, 1990] William H. Jefferys. Bayesian analysis of random event generator data. J. Scientific Exploration, 4:153–169, 1990. [Johnson, 1970] Richard A. Johnson. Asymptotic expansions associated with posterior distributions. Annals of Mathematical Statistics, 41:851–864, 1970. [Kalbfleisch, 1986] John D. Kalbfleisch. Pseudo-likelihood. In Samuel Kotz, Norman L. Johnson, and Campbell P. Read, editors, Encyclopedia of Statistical Sciences, volume 7, pages 324–327. Wiley-Interscience, New York, 1986. [O’Hagan, 1995] Anthony O’Hagan. Fractional Bayes factors for model comparison (with Discussion). Journal of the Royal Statistical Society, Series B, 57:99–138, 1995. [Pauler, 1998] Donna K. Pauler. The Schwarz criterion and related methods for normal linear models. Biometrika, 85:13–27, 1998. [Spiegelhalter and Smith, 1982] David J. Spiegelhalter and Adrian F. M. Smith. Bayes factors for linear and log-linear models with vague prior information. Journal of the Royal Statistical Society, Series B, 44:377–387, 1982. [Tierney and Kadane, 1986] Luke Tierney and Joseph B. Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81:82–86, 1986. [Wald, 1949] Abraham Wald. Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics, 20:595–610, 1949. [Wasserman, 2000] Larry Alan Wasserman. Bayesian model selection and model averaging. Journal of Mathematical Psychology, 44:92–107, 2000.
A
APPENDIX: PROOF OF THEOREM 1
Proof. We can write n
n
(X ) log qpn (X n)
(A.1)
= log =
pn (X n ) pn (X n |θˆn )
A
+ +
log
pn (X n |θˆn ) pn (X n |θ ∗ )
B
+ +
log
pn (X n |θ ∗ ) q n (X n )
C
say,
where θˆn denotes the maximum likelihood estimator of θ in model M based on data X n . Henceforth we drop n from the notation. Note that the prior distribution only enters into the term A. The asymptotic form of the term A, based on Laplace expansion, is well-known [Tierney and Kadane, 1986]. We have: (A.2) A =
ˆ p(θ) 1 d log 2π + log + O(n−1 ). ˆ 21 2 {det −l′′ (θ)}
This expansion is essentially algebraic, rather than probabilistic, and will typically hold for essentially all data sequences, whether or not they appear to be generated by some distribution in the model M ; in particular, we can regard the final term as Op (n−1 ) under Q. Under Q, θˆ → θ∗ almost surely [Wald, 1949]. Taylor expansion in the neighbourhood of θˆ yields: ˆ − θ), ˆ (A.3) l′ (θ) ≈ l′′ (θ)(θ
Posterior Model Probabilities
629
so that ˆ −1 l′ (θ). (A.4) θˆ − θ ≈ −l′′ (θ) Under Q, each of l′ (θ), l′′ (θ) is a sum of n independent and identically distributed components; moreover, it may be checked that EQ {l′ (θ∗ )} = 0. It follows that 1 L (A.5) n 2 (θˆ − θ∗ ) → N {0, (J2∗ −1 )T J1∗ J2∗ −1 },
where J1∗ = J1 (θ∗ ), J2∗ = J2 (θ∗ ), with J1 (θ) = varQ {l1′ (θ)}, J2 (θ) = EQ {−l1′′ (θ)}. 1 ˆ = nJ ∗ + Op (n 21 ). It then follows In particular, under Q, θˆ = θ∗ + Op (n− 2 ), −l′′ (θ) 2 from (??.2) that n p(θ∗ ) 1 − 12 + log ). (A.6) A = − d log 1 + Op (n ∗ 2 2π {det J2 } 2 Note in particular that the asymptotic effect of the within-model prior density n (X n ) p(θ) on log pqn (X n ) is entirely captured in its order 1 contribution to (??.6). From this point on we consider the two cases (i) and (ii) individually. (i) If Q 6∈ M , then K(Q, M ) > 0. In this case, under Q, log p(X | θ∗ )/q(X) is a sum of n independent and identically distributed components with mean −K(Q, M ), and thus term C in (??.1) has the form 12 log p1 (X | θ∗ ) (A.7) C = −nK(Q, M ) + n varQ Z, q 1 (X) L
where Z → N (0, 1). Equation (??.6) shows that A = Op (log n). Finally, ˆ − l(θ∗ ). By Taylor expansion, consider B = l(θ) B
≈ ≈
(A.8)
1 1 1 ˆ ˆ ˆ − θ∗ )} 2 (θ {n 2 (θ − θ∗ )}T {−n−1 l′′ (θ)}{n 2 1 1 1 ˆ {n 2 (θ − θ∗ )}T {J2∗ }{n 2 (θˆ − θ∗ )}, 2
which, by (??.5), is Op (1). Thus both terms A and B are of smaller order than C, and (4) follows. (ii) Now suppose Q ∈ M , so that K(Q, M ) = 0 and Q = Pθ∗ . In this case, term C in (??.1) vanishes. We also now have J1∗ = J2∗ = I(θ∗ ). Thus 1 n 1 + log ρ(θ∗ ) + Op (n− 2 ). (A.9) A = − d log 2 2π
Also by (??.8) and (??.5), or directly from Wilks’ Theorem, L
(A.10) B →
1 2 χ , 2 d
630
A. Philip Dawid
and in fact the distribution of B will typically differ from that of 21 χ2d by 1 O(n− 2 ). Then (5) follows. (For a more rigorous treatment, see Clarke and Barron [1990].)
Part VI
Attempts to Understand Different Aspects of “Randomness”
This page intentionally left blank
DEFINING RANDOMNESS Deborah Bennett We all have a commonsense understanding of randomness. Ordinary, everyday definitions of randomness encompass notions of lack of order, periodicity, pattern, aim, or purpose. When we reach for simple examples of a random occurrence, we usually think of a coin toss, a dice throw, or a lottery. If your chance of tossing a head on a coin toss is the same as mine in order to decide who gets to go first or who gets first choice, no one has an unfair advantage. If you have a fair chance of winning a lottery, then you are satisfied that the choice of the winner is random. There is something democratic about randomness. The inability to know in advance what is coming prevents us and, more importantly, prevents others from manipulating or controlling events. The inability of mere mortals to manipulate the outcome is undoubtedly what led early societies to employ the use of random devices to reveal the gods’ decisions in divination. From ancient Babylonian and Hebrew societies, through ancient Greek and Roman uses, the lot revealed divine intentions. The lot may or may not have been controlled by a deity, but the outcome of the lottery was beyond the control of other humans. Its use was somehow fair. Not only were random devices used in divination, but they have also been used in games since ancient times. Practically every society, modern as well as ancient, uses some type of randomizers in games. Today’s six-sided cubical dice have been around for over three thousand years, but much earlier dice were two-sided and four-sided. Again, the notion of fairness arises, as the introduction of chance into a game seems to level the playing field between unmatched brains or brawn. While we all have an understanding of the concept of randomness, pinning down a satisfactory definition can prove to be troublesome. What we may agree on is that randomness encompasses the concepts of fairness and unpredictability. In science, an event is described as random if it may or may not occur. It may be more likely to occur, less likely to occur, or equally likely to occur as some other random event. Of course, if we know that an event will occur, it is certain and if we know that the event will not occur, it is impossible. But a random or chance event defies precise predictability. Some would prefer randomness to remain undefined—just as the concept of a point or a line remains undefined in geometry. But for others, the definition of randomness is essential since it affects our concept of probability. Randomness is the foundation on which the calculus of probability is based. It may seem odd, but the unpredictability of the random event allows prediction at the macro-level by way of the laws of chance. The unruliness of chance allows laws of probability. Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
634
Deborah Bennett
Although there are similarities, randomness is not the same as chaos and it is not anarchy. Nicholas Rescher describes their shared characteristics, saying that “rational forecasting of outcomes is simply infeasible.” In the case of chaos, processes are so sensitive to minute changes in conditions that determination of their instability is beyond our power. In the case of anarchy, there are no laws, only mayhem. RANDOMNESS FINALLY DEFINED I suppose no one felt a need to define randomness until probability became a science. Games of chance that utilized random devices or randomizers had existed since ancient times, but it took many centuries before chances were quantified and probabilities were studied. The calculus of probability arose from the analysis of games of chance, appearing first in 1564 when Girolamo Cardano wrote a book analyzing various games of chance. Cardano’s book was not published until almost a hundred years later, but between 1613 and 1623 Galileo Galilei wrote a small tract, entitled Thoughts about Dice Games. In the mid-17th century, serious discussion of probability began with the correspondence between Blaise Pascal and Pierre de Fermat over a gambling question from the Chevalier de M´er´e. Despite the fact that the Pascal/Fermat correspondence originated from a seemingly trivial gaming problem, the consideration they gave the subject was noticed by other mathematicians and philosophers of the time. The discussions of these distinguished mathematicians caught the attention of Christian Huygens who wrote the first printed text on the calculus of probability in 1692. Even then, no one thought it necessary to define probability, much less randomness. These probability pioneers were followed by eighteenth-century greats, such as Abraham de Moivre whose The Doctrine of Chance saw multiple editions from 1711 through 1756. The almost three-hundred year development of the theory of probability culminated in the 19th century with Pierre-Simon Laplace who was first to recognize the need for a precise definition of probability. Laplace’s work represented what came to be known as the classical interpretation of probability. The nineteenth century enjoyed an explosion of the applications of probability and the initial stages of its use in statistical inference. Scientists became increasingly aware that analogies existed between games of chance and random phenomena in biology and the physical and social sciences. The twentieth century produced even more applications of probability—in physics, economics, insurance, and telecommunications. Classical interpretations of probability gave way to the frequency interpretation of probability: The probability of an event is its relative frequency over a long, long time. Frequentists defined probability in terms of the relative frequency of objective properties that they could measure and sample. Leading the way for this point of view were Jerzy Neyman and Egon Pearson, who were influenced by John Venn, R. A. Fisher, and Richard von Mises. As they drew random samples for experiments they were performing in order to attain statistical estimates and perform statistical
Defining Randomness
635
tests, the frequentists perceived a need to define random. Definitions of randomness began to be established in the 20th century when Richard von Mises attempted to provide a definition of a random sequence that was both mathematically and intuitively satisfying to the frequency view of probability. According to von Mises, a sequence was random if it is impossible to predict where in the sequence a particular observation will occur. Von Mises’s definition is often referred to as the impossibility of a gambling system. If it is impossible for a gambler to invent a system of prediction that will change his long-run odds of winning, then the gambler is playing a game of pure chance. No system and no formula can specify the gambler’s outcomes ahead of time thus enabling him to adjust his bets in anticipation of a particular outcome. If we are unable to devise a system for predicting a particular observation without prior knowledge of the sequence, then we are dealing with a random sequence. The intuitively appealing part of von Mises’s definition is that it eliminates systematic, patterned sequences since they would be predictable. The most popular opposing theory to that of the frequentists is that of the subjectivists (or Bayesians) who espouse epistemic probabilities. Subjective probabilities are judgments, perhaps informed judgments. Subjective probabilists express degrees of belief in uncertain propositions. Personal or subjective evaluations of probability are used particularly when events have not occurred often enough to provide long term evidence. Before you object that odds can’t be subjective, think about how the odds are set in a horse race. The degree of belief by the bettors influences their bets and the odds for a horse are based on the distribution of bets by those subjective believers. For the subjectivists, their definition of randomness is founded in their principle of indifference or principle of insufficient reason. Von Mises’s view defines randomness in terms of the inability to reliably predict over the long run. In the toss of a coin, the outcome is random and the sides are equally-likely because the relative frequencies over the long, long run have demonstrated that this is so. For the subjectivist, the toss of a coin is random because we have no reason to believe that one side is more likely to appear than the other, thus forcing us to be indifferent to the choice. Even experts disagree about the many meanings of randomness. Do you think that the throw of a die is random with the sides equally likely to come up because centuries of dice-throwing has shown us that each side comes up one-sixth of the time? Or, do you base your thinking on the physical uncertainty of the situation? The die appears to be constructed in such a way that any side can come up equally often and, in the absence of further information, you have no sufficient reason to judge otherwise. As the science of probability and statistics gained widespread application in the twentieth century, inferences needed to be based on random samples, and sources for random numbers began to be explored. A new definition of a random sequence evolved—an algorithmic definition. Definitions of complexity in information theory led Andrei Kolmogorov and Gregory Chaitin to define a sequence as random by virtue of the length of the program, algorithm, or rule that is necessary to describe
636
Deborah Bennett
it. The more compact the rule, the less random the sequence. The longer the rule, the more random the sequence. The ideal random sequence is one that cannot be described by a rule shorter than the length of the sequence itself (any sequence can be described by naming the digits of the sequence one by one). INADEQUACIES AND DEFICIENCIES One argument against the long run frequency view of probability has to do with its definition of randomness. The impossibility of a prediction system implies that no rule or formula exists to predict a random sequence. The insistence that “no rule exists” seems a rather stringent requirement. Perhaps a rule exists but if it is unknown to everyone, then a prediction system may prove impossible nevertheless. Statistical experimentation most often encountered randomness in connection with random samples. Random samples were selected according to some probabilistic criteria and often, in order to select members of a sample randomly, random number generators were needed. A random number generator might involve a simple procedure such as drawing numbered cards (after mixing them) or rolling a many-sided die. On the more sophisticated side, a random number generator might be a fancy computer algorithm. Wait a minute . . . , you say. A computer program written by human beings must produce deterministic output. If this is so, how can a mathematical formula create random digits? Non-manmade sources for random digits have also been considered. The digits of irrational numbers like π have been examined for randomness. These nonrepeating, never-ending decimal numbers have the sense of the unexpected, but they are still determined constants. I may not know the 342nd digit of the decimal expansion of π, but I can assure you that someone does. Is randomness caused by the process or the outcome? A computer random number generator is not a random process but it may output a sequence that appears random. Is something random by virtue of its disorderly appearance or by virtue of the mechanism or device that created the disorderly appearance? Indeed, a random process does not even guarantee a disorderly appearance. Suppose we wish to generate a random sequence of 0s and 1s by using some lottery or coin-tossing method. Every possible sequence has an equal chance of being the sequence that we create. Therefore, the sequences of all 0s or all 1s or alternating 0s and 1s have an equal chance of being created. Should one of these occur as our random sequence, we might want to throw it out as not looking random enough. The process was random enough, but the product, the sequence we produced, was quite orderly and patterned. If a sequence is random by virtue of the device that created it, then do we toss it out if it doesn’t appear “random enough”? When we are faced with a supposed random sequence, must we question how it was generated or should we judge its randomness based on how it looks? Some experts have said that it matters not how the sequence was generated, what is important is its disorderliness. If a sequence is random by virtue of its disorderliness
Defining Randomness
637
then random sequences must be able to pass objective tests of disorderliness. Numerous measures of orderliness (or disorderliness) have been developed and have become tests for randomness. But finding a sequence that passes all tests faces the same difficulties as finding a sequence that passes all systems of prediction (the impossibility of a gambling system). We must then decide: If we can’t require our sequence to pass all tests of randomness, then what percent of the tests should our sequence pass? Which tests of randomness should our sequence pass? And, what level of significance shall we set for the tests? The definition of a random sequence emanating from the realm of informational complexity is based on both the process and the output. If a sequence output is complex enough to require a lengthy description, it is a random sequence. For example, a 100-digit sequence of 0s and 1s that requires a simple (or short) description is not very random. Take the examples mentioned earlier. The descriptions, “all 0s” or “all 1s” or “alternating 0s and 1s,” are concise descriptions and therefore do not describe sequences complex enough to be defined as being random. This is intuitively pleasing since they do not feel very random; they hold no sense of uncertainty for us. The longest possible description of a sequence of 0s and 1s is as long as the sequence itself. In order for the maximally complex sequence to be described, each digit must be named one by one. For a computer-generated sequence, this “description of a sequence” equates to the computer program or algorithm used to produce the sequence, and program lengths can be measured. However, to create random sequences we need long programs and long computer programs are neither efficient nor practical. On the other hand, measuring the complexity of sequences does provide us with a scale against which the randomness of a sequence can be measured along a continuum. In other words, we can measure degrees of randomness. COMMONALITIES None of the definitions of randomness are free from contradiction and none of them seem completely satisfactory. They do, however, share some commonalities. They all encompass our uncertainty, and they all hold perfect randomness to be an absolute. The frequency view of randomness is based on our inability to know what is coming in the long run; perfect randomness is found when no system exists that would allow us to predict what is coming. The subjective view of randomness is founded on our ignorance of conditions or the unknowability of outcomes, and this uncertainty guarantees insufficient understanding to predict accurately. Randomness based on the length of the description necessary to identify a given sequence provides us with a way to measure the degree of disorder and defines the sequence that is maximally complex. If perfect randomness is an absolute, we may be forced to accept the reality that an absolute is often unattainable. Pure chance, the ideal, may be impossible to achieve. The chance outcomes we are so familiar with may not be situations involving pure chance; they may, in fact, just be chancy. If perfect randomness is
638
Deborah Bennett
beyond our reach, we may be forced to accept relative randomness. Although the ideal random sequence is maximally complex, a relatively random sequence may be “complex enough.” Although a sequence may not pass all possible statistical tests for randomness, we may set a threshold level of how many and which tests must be passed in order to declare a sequence “random enough.” We may even accept different threshold levels for different purposes. The thresholds we set for randomly choosing components of a nuclear reactor to test might differ from those we set to randomly choose a door prize at a charity event. “Relative” randomness would permit degrees of randomness depending on the situation. We began by suggesting that randomness may be difficult to define. Even relative randomness may defy definition. However, we can pinpoint qualities that we require. We require that relative randomness fulfill our sense of fairness and equity over the long run. Like absolute randomness, relative randomness may seem fickle and unfair in the short run, but it must appear predictable in the long run. We will accept local patterns but not global ones. We require that relative randomness satisfy our notion of uncertainty and unpredictability. It should be not only impractical for us to predict the relatively random outcome, it should be impossible for us to make rational predictions. We require that relative randomness measure up to ideal randomness as much as is humanly possible. As we reach for perfection, we will accept only the near perfect. ACKNOWLEDGMENTS I would like to acknowledge the anonymous referee who offered suggestions for this paper. BIBLIOGRAPHY [Bennett, 1998] D. J. Bennett. Randomness. Cambridge, Mass: Harvard University Press, 1998. [Chaitin, 2001] G. Chaitin. Exploring Randomness. London: Springer-Verlag, 2001. [Chaitin, 1975] G. Chaitin. Randomness and mathematical proof. Scientific American 232 (5): 47-52, 1975. [Church, 1940] A. Church. On the concept of a random sequence. American Mathematical Society Bulletin 46: 130-135, 1940. [David, 1988] F. N. David. Games, Gods & Gambling: A History of Probability and Statistical Ideas. Dover Publications, 1988. [Eagle, 2005] A. Eagle. Randomness is unpredictability. British Society for the Philosophy of Science 56: 749-790, 2005. [Fisher, 1926] R. A. Fisher. On the random sequence. Quarterly Journal of the Royal Meteorological Society 52: 250, 1926. [Gardner, 1977] M. Gardner. Random numbers. In Mathematical Carnival. New York: Vintage Books, 1977. [Hacking, 1965] I. Hacking. Logic of Scientific Inference. Cambridge University Press, 1965. [Humphreys, 1976] P. W. Humphreys. Inquiries in the philosophy of probability: Randomness and independence. Ph.D. diss., Stanford University, 1976. [Knuth, 1997] D. E. Knuth. The Art of Computer Programming. Vol. 3: Seminumerical Algorithms, 3 rd ed. Reading, MA: Addison-Wesley, 1997.
Defining Randomness
639
[Kolmogorov, 1963] A. Kolmogorov. On tables of random numbers. Sankhya, ser. A 25: 369-376. 1963. [Kolmogorov, 1968] A. Kolmogorov. Three approaches for defining the concept of information quantity. Problems of Information Transmission 1: 4-7, 1968. [Martin-L¨ of, 1969] P. Martin-L¨ of. The literature on von Mises Kollectivs revisited. Theoria 35: 12-37, 1969. [Mises, 1939] R. von Mises. Probability, Statistics and Truth. 2nd ed. Trans. J. Neyman, D. Sholl, and E. Rabinowitsch. New York: Macmillan, 1939. [Ore, 1953] O. Ore. Cardano, the Gambling Scholar. Princeton University Press, 1953. [Porter, 1988] T. M. Porter. The Rise of Statistical Thinking 1820-1900. Princeton University Press, 1988. [Rescher, 1995] N. Rescher. Luck: The Brilliant Randomness of Everyday Life. New York: Farrar Straus Giroux, 1995. [Stigler, 1990] S. Stigler. The History of Statistics: The Measurement of Uncertainty before 1900. Belknap Press, 1990.
This page intentionally left blank
MATHEMATICAL FOUNDATIONS OF RANDOMNESS Abhijit Dasgupta This article is dedicated to the centenary of the Borel Strong Law.
1
1.1
INTRODUCTION
A random blackbox?
Imagine a “blackbox” which supposedly produces its outcomes “randomly” according to some fixed finite probability distribution. BLACKBOX → Outcome Thus there are a finite number of possible outcomes, say ω1 , ω2 , . . . , ωn , and each outcome ωi has a fixed non-trivial Pn probability pi , so that P (ωi ) = pi with 0 < pi < 1 for i = 1, 2, . . . , n, and i=1 pi = 1. The outcomes may be generated either automatically from a continuously running process, or on demand, say by pressing a button on the blackbox. We think of this as an abstract model representing a process, a machine, or an experiment. Familiar examples are the flip of a coin, the turn of a casino gambling wheel, the roll of an “electronic die” on a handheld video game device, the snapshot of weather data, the time between two successive clicks of a Geiger counter detecting radioactive decay, the stock market index value, etc. Here, we approach the problem in a purely extensional way. This means that except for the given knowledge of the probability values of for each possible outcome, we are only able to observe the outcomes, and do not have any access to, or information about, the internal workings of the machine. Hence the term blackbox. Gambling houses and forecasters of weather and stock market as well as philosophers of probability and statistics have found the following question to be of considerable interest. Question A. Is the blackbox a random device? Does it produce its outcomes randomly (while obeying a fixed probability distribution)?
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
642
1.2
Abhijit Dasgupta
Sequences
We cannot hope to answer the last question by observing a single outcome of the blackbox. In fact, no finite amount of observation of outcomes can fully confirm that a process is random (or not). On the other hand, by repeating the process a sufficiently large number of times and observing the resulting sequence of outcomes, we may hope to gain enough information for answering the question with a desired level of confidence. (Such a sequence is denoted by hx1 , x2 , . . . , xk , . . . i, where each xk , the k-th term of the sequence, equals one of the outcome values ω1 , . . . , ωn .) Given a sequence of outcomes of the blackbox, we want to determine whether or not it was produced randomly. Our extensional approach means that other than a knowledge of the underlying probability distribution for the possible outcome values, we only have the completed sequence of outcomes available, with all information about its origination permanently removed, so the answer must depend solely on the sequence itself and not on how it was produced. This necessitates the consideration of a different but more precise question: Question B. When is a given sequence of outcomes random (relative to a fixed probability distribution for the outcome values)? This article deals entirely with this last question (Question B, randomness of sequences), and not with the earlier question (Question A, randomness of processes). It is one of the fundamental questions in the philosophy of probability and statistics. The longer the sequence, the more information we have for determining if it is random or not. And in the ideal case, the sequences will be infinite sequences. We will consider both cases — finite and infinite sequences — of the question in detail. It was von Mises who first treated this question rigorously (for infinite sequences), and considered its answer to be the very foundation of probability, known as the frequentist interpretation of probability.
1.3
Pseudo-randomness: A Galilean Dialogue
In his celebrated work G¨ odel, Escher, Bach, Douglas Hofstadter quotes the following “beautiful and memorable passage” from Are Quanta Real? — a Galilean Dialogue by J. M. Jauch [Jauch, 1990]: Salviati: Suppose I give you two sequences of numbers, such as 78539816339744830961566084 ... and 1, −1/3, +1/5, −1/7, +1/9, −1/11, +1/13, −1/15, . . . If I asked you, Simplicio, what the next number of the first sequence is, what would you say?
Mathematical Foundations of Randomness
643
Simplicio: I could not tell you. I think it is a random sequence and that there is no law in it. Salviati: And for the second sequence? Simplicio: That would be easy. It must be +1/17. Salviati: Right. But what would you say if I told you that the first sequence is also constructed by a law and this law is in fact identical with the one you have just discovered for the second sequence? Simplicio: This does not seem probable to me. Salviati: But it is indeed so, since the first sequence is simply the beginning of the decimal fraction [expansion] of the sum of the second. Its value is π/4. Simplicio: You are full of such mathematical tricks . . .
The dialogue illustrates an aspect (among many) of the problem of defining randomness for sequences. An apparently random sequence of digits may really be pseudo-random: While it may appear to be “statistically random” and unpredictable, there may be a (hidden) rule or arithmetical method for generating the entire infinite sequence purely deterministically. Our intuition tells us that if there is a deterministic and effective rule for computing every term of an infinite sequence, then, despite appearance, the sequence cannot be genuinely random, as von Neumann famously cautioned. But how do we precisely define the class of infinite sequences that are “genuinely random”?1
1.4
A Laplacian problem
Imagine our blackbox to be a computer program simulating the flip of a fair coin, with 1 denoting heads, and 0 denoting tails. Suppose that we run it to generate fifty flips and observe the resulting outcome sequence. Such an outcome sequence is a binary string of length 50, and there are 250 possible outcome sequences. If we observe the outcome sequence to be 10010001001100111011001010100000110100001100011100, we are not surprised. We regard it as a “random binary string”, and consider the program to be running normally. But if the outcome sequence is 01010101010101010101010101010101010101010101010101, we consider it to be an extraordinary event, and justifiably suspect malfunction, perhaps a bug in the program. But according to simple probability, this binary 1 Note
that it would be an error to define random infinite sequences as those which cannot be generated deterministically by a specified rule or arithmetical method. This is because there are only countably many such methods, while there are uncountably many sequences which are not random in any sense of the word. For example, there are uncountably many infinite binary sequences for which the bits at even positions are all set to 0 but the bits at odd positions are allowed to be arbitrary, and none of these sequences is random.
644
Abhijit Dasgupta
string of perfectly alternating 0s and 1s is as likely to be produced as the first supposedly “random-looking” string, and so should not be regarded any more special than the first one. Yet, in a clear intuitive sense, the “regularity” of the second string makes it much less random compared to the first string. P. S. Laplace (1749–1827) was aware of this problem and pointed out the following reason why, intuitively, a regular outcome of a random event is unlikely: “We arrange in our thought, all possible events in various classes; and we regard as extraordinary those classes which include a very small number. In the game of heads and tails, if heads comes up a hundred times in a row, then this appears to us extraordinary, because the almost infinite number of combinations that can arise in a hundred throws are divided in regular sequences, or those in which we observe a rule that is easy to grasp, and in irregular sequences, that are incomparably more numerous.” [de Laplace, 18191952] In addition to the rarity of regular patterns, Laplace points out that certain strings may have a “cause”, making them unlikely to be random: “The regular combinations occur more rarely only because they are less numerous. If we seek a cause whenever we perceive symmetry, it is not that we regard the symmetrical event as less possible than the others, but, since this event ought to be the effect of a regular cause or that of chance, the first of these suppositions, is more probable than the second. On a table, we see letters arranged in this order: C o n s t a n t i n o p l e, and we judge that this arrangement is not the result of chance, not because it is less possible than others, for if this word were not employed in any language we would not suspect it came from any particular cause, but this word being in use among us, it is incomparably more probable that some person has thus arranged the aforesaid letters than this arrangement is due to chance.” [de Laplace, 18191952] We therefore observe (still using our vague and imprecise language) that among all finite strings of a fixed large size (say length 100), there are only a relatively small number of “non-random” strings — strings which possess “regularity” or more generally have some “cause” behind them. We may call the other strings to be random. Thus, it appears that for finite strings of a fixed large size (say length 100), there is some attribute that corresponds to the intuitive notion of randomness. But how do we precisely define it? This intuitive attribute of randomness can be partially approximated by formulating certain events described in the ordinary language of classical probability. E.g., by requiring that the standard deviation of the run lengths of digits be within certain limits one can avoid such regular sequences as the one above with alternating 0 and 1. In fact, such events are designed and formulated as statistical tests
Mathematical Foundations of Randomness
645
for estimating the “randomness confidence” of a finite sequence of digits. Some examples are restrictions on the distribution of run-lengths, autocorrelation, serial correlation, comparison with standard test distributions such as χ2 -tests, etc.2 However, all such tests appear to be only partial approximations, and no event formulated in the usual language of classical probability seem to precisely capture this attribute of randomness in an intuitively satisfactory way.
1.5
The main problems. What this article is about
The considerations above lead us to two classic problems of mathematical and statistical philosophy: PROBLEM 1 Randomness for Infinite Sequences. sequence of digits random?
When is a given infinite
PROBLEM 2 Randomness for Finite Strings. When is a given finite string of digits random? As it turned out, most of the early mathematical results concerning randomness were about infinite sequences.3 Work on defining randomness for finite sequences started later.4 As Ulam had noted [MacHale, 1993]: “The infinite we shall do right away. The finite may take a little longer.” It also turned out that the two concepts are closely connected in a remarkable way with “algorithm” or “effective computability” playing a central role in their definitions. Quite satisfactory answers to both problems emerged almost simultaneously in mid 1960s, together with the birth of the new fields of Algorithmic Randomness and Algorithmic Complexity. The primary goal of this article is to present these celebrated solutions of the above two classic philosophical problems as a brief introduction to algorithmic randomness and complexity. It is, however, important to note that a vigorous effort to further sharpen, refine, and calibrate the definition of randomness is currently being pursued through a large and growing body of lively research activity in algorithmic randomness [Nies, 2009; Downey and Hirschfeldt, 2010; Li and Vitanyi, 2008; Shen et al., 20??]. Therefore the topic is best viewed as a part of continuing research, and there is still ample scope for debate if the answers obtained in the 1960s are final. 2 See Knuth [1998] for such statistical tests of randomness. Chris Wetzel of Rhodes College has interactive web pages illustrating such tests of randomness (http://faculty.rhodes.edu/ wetzel/random/intro.html). 3 Borel’s 1909 work on normal numbers and the Weyl equidistribution theorem of 1916 are preliminary forms of randomness for infinite sequences, but it was von Mises who in 1919 directly focused on the concept and gave a fundamental definition. Church’s introduction of “algorithm” in 1940 made von Mises’ definition mathematically precise and also made the subject permanently “algorithmic”. For a truly satisfactory answer, one had to wait another quarter century until Martin-L¨ of found it in 1965. 4 Early work was done in the 1960s by Solomonoff [1960; 1964], Kolmogorov [1963; 1965], and then Chaitin (see [Chaitin, 1992] for references).
646
Abhijit Dasgupta
Throughout this article, we will consider only strings and sequences, and make the further simplifying assumption that the only digits (possible outcomes) are 0 and 1 with equal probability of 12 each — the “fair coin model”. In other words, the blackbox always represents the flip a perfectly fair coin. In particular, all strings and sequences will be binary strings and sequences. This restriction is not really as stringent as it may appear to be (see subsection 2.5).
1.6
The solutions. Algorithmic randomness
There are three key approaches for defining randomness in sequences: unpredictability, typicality, and incompressibility [Downey and Hirschfeldt, 2010]. We now briefly and informally introduce these approaches. They will be described in detail using mathematically precise language later in this article. • Unpredictability. This can be understood in terms of impossibility of successful gambling strategies. According to randomness as unpredictability, an infinite binary sequence is random if, roughly speaking, it is impossible to effectively specify a gambling strategy which can make long run gains for the gambler when played against this sequence as the outcomes. A frequentist version of randomness as unpredictability was used by von Mises; see the comments of Feller quoted in subsection 5.1. • Typicality. A property or attribute of infinite binary sequences is called special if the probability that the property holds is zero, and is called typical if the probability that the property holds is one. An attribute is special if its complement (negation) is typical, and vice versa. For example, consider the property of having no run of zeros of length seven. It can be shown that the probability is zero that an infinite binary sequence has no run of zeros of length seven, thus this property is special. The intuition here is that if a sequence has a special property or attribute, then it cannot be random, and randomness is equivalent to the complete lack of special attributes. According to randomness as typicality, an infinite binary sequence is random if, roughly speaking, it is impossible to effectively specify a special property that the sequence possesses. This is essentially the definition of Martin-L¨ofrandomness. What Martin-L¨of did was to provide a mathematically precise notion of effective specifiability in this context. • Incompressibility. Some finite strings can be effectively specified by descriptions much shorter than the string itself. For example, the string 01010101010101010101010101010101010101010101010101, can be specified as “01 repeated fifty times”. The short description here exploits the regularity of pattern. The idea is also utilized by data compression programs: Strings which compress well are the ones having regularity of pattern or redundancy of information, and highly irregular or “random” strings
Mathematical Foundations of Randomness
647
do not compress well. This approach, known as Kolmogorov Complexity, was introduced by Solomonoff, Kolmogorov, and Chaitin, and yields a definition of randomness for finite strings: A string is random if its shortest description has length equal to the length of the string itself. This can also be carried over to infinite strings: According to randomness as incompressibility, an infinite binary sequence is random if none of its initial segments compress “very much”. These three approaches to defining randomness may appear to be independent of each other, and we might expect them to lead to different definitions of randomness. It is therefore a remarkable fact that (with appropriate choices for the notion of effective specifiability) all three definitions turn out to be equivalent! Schnorr’s Theorem, a celebrated result, establishes the equivalence of randomness as typicality (i.e. Martin-L¨ of randomness) with randomness as incompressibility. This surprising equivalence between randomness as unpredictability, randomness as typicality, and randomness as incompressibility, is strong evidence that this common equivalent definition is satisfactory, and that algorithmic randomness provides an adequate mathematical foundation for randomness.
1.7
What this article is not about
Let us now clearly state a disclaimer. This article’s scope is restricted to discussions of only Problems 1 and 2 above, dealing with randomness of sequences and strings from a purely extensional view. Our coverage of randomness is thus limited to what may perhaps be called “the randomness of bit patterns”, while the word randomness as used in ordinary language is overloaded with meanings many of which do not necessarily involve bit patterns (binary strings and sequences). Randomness may be discussed in a more general setting than just bit patterns. This has been done, e.g., by Eagle, who introduces the idea of randomness as maximal unpredictability, and has written a critique of algorithmic randomness. One may also regard random and probabilistic processes as the main framework for investigating randomness. See Eagle [2005], where further references can be found. Also, there are popular books on randomness (e.g. Aczel [2004], Beltrami [1999], and Bennett [1998]) with accessible approaches to various aspects of randomness. The books by Taleb [2005] and by Mlodinow [2008] include various financial and social implications of randomness as well. Even within the limited case of “bit-pattern randomness”, there are other important issues, such as the following broad question of considerable philosophical and practical interest: How can random sequences be physically generated? Another example is the relation between quantum mechanics and (algorithmic) randomness. Several interesting discussions and further references are in [Calude, 2005; Yurtsever, 2000; Svozil, 1993; Longo, 2009; Bailly and Longo, 2007; Penrose, 1989; Stewart, 2002].
648
Abhijit Dasgupta
Because of our extensional approach, we have omitted all such issues and limited ourselves only to mathematical definitions of randomness for sequences, i.e., precise criteria to distinguish random sequences from non-random ones. But even for mathematical definitions based on the extensional view, algorithmic randomness is not the only approach possible. For example, in [van Lambalgen, 1990], Van Lambalgen introduces an axiomatic approach to randomness. We will not discuss such approaches either. Other than reasons of space limitation and cohesion, there is another point which has compelled us to select only the topic of algorithmic randomness for discussion, This year marks the 100th anniversary of Borel’s strong law (1909), which first showed the importance of the concept of limiting frequency, and is the precursor of many ideas based on it, including von Mises’ idea of randomness. We believe that in the one hundred years since Borel’s strong law, algorithmic randomness stands out as the crowning achievement in the study of random sequences. In addition to providing deep philosophical insight, it has unraveled fundamental connections of randomness with many diverse areas that has been truly spectacular.5
1.8
How this article is organized
Section 2 sets up notation for strings and sequences, the Cantor Space, which forms the basic framework for carrying out further discussions, and contains a brief and informal introduction to Lebesgue measure on the unit interval and the Cantor space, which we treat as equivalent. Discussion of mathematical randomness begins in Section 3, where we define a series of classical stochastic or frequency properties in an effort to distill out randomness, but conclude by noting how this forces upon us notions like definability and effective computability. In Section 4, an informal but precise definition of effective computability is given. Sections 5, 6, 7, 8, and 9 form the core material on algorithmic randomness: Von Mises randomness, Martin-L¨of randomness, Kolmogorov complexity and randomness of finite strings, application to G¨odel incompleteness, and Schnorr’s theorem. Sections 10 and 11 form an overview of somewhat more advanced and recent topics, including definitions of various other forms of related randomness notions that have been studied, and indicating the current state of affairs. 5 Algorithmic randomness is intimately related to the creation and development of several other fields such as Algorithmic Probability (originated by R. J. Solomonoff) and Universal Search (originated by L. A. Levin), all of which together comprise a broader (and vast) area now generally known as Algorithmic Information Theory. In addition to the creation of such fields, there has been wide and fundamental impact on many areas in mathematics, philosophy, statistics, and computer science: We mention recursion theory and Hausdorff dimensions; inductive and statistical inference and prediction; classical information theory; and complexity theory, machine learning, and artificial intelligence.
Mathematical Foundations of Randomness
649
Finally, in Section 12 we express our view on the Martin-L¨of-Chaitin thesis. Proofs of mathematical results are provided or outlined whenever they are fairly straightforward. Involved proofs are omitted, almost all of which can be found in one of the books [Feller, 1968; Knuth, 1998; Li and Vitanyi, 2008; Downey and Hirschfeldt, 2010; Nies, 2009; Chaitin, 1992; Calude, 1994; Odifreddi, 1992]. 2
STRINGS, SEQUENCES, CANTOR SPACE, AND LEBESGUE MEASURE
We use the term “string” (if not further qualified) to be a synonym for “finite sequence”, and the term “sequence” (if not further qualified) to be a synonym for “infinite sequence”. To describe sequences and strings, we will use a finite alphabet Σ consisting of a finite number of letters or symbols. E.g., the decimal alphabet Σ = {0, 1, 2, . . . , 9} consists of ten digits, and the binary alphabet Σ = {0, 1} consists of just two bits. A string from Σ is simply a finite sequence of members of Σ. The empty string is denoted by Λ. The set of all strings (finite sequences) from Σ is denoted by Σ∗ . For a string σ ∈ Σ∗ , |σ| denotes the length of σ. N denotes the set of all natural numbers. We deliberately leave it ambiguous if zero is considered a natural number or not. It can always be understood from context, and this ambiguity does not cause any problem. The set of all infinite sequences from Σ will be denoted by ΣN . For x ∈ ΣN , we will often write x = ∞ hxn i∞ n=1 = hx(n)in=1 , where xn = x(n) ∈ Σ for each n = 1, 2, . . . .
2.1
Binary strings and sequences. The Cantor space
To simplify the discussion, we will almost always limit ourselves to binary sequences and strings, i.e. the case where the alphabet is Σ = {0, 1}. The set {0, 1}∗ is the set of all binary strings. The set {0, 1}N is the set of all infinite binary sequences, and is called the Cantor space. If σ ∈ {0, 1}∗ is a finite binary string, we will let N (σ) to denote the set of all infinite binary sequences beginning with σ. E.g., N (101) consists of all infinite binary sequences hxn i∞ n=1 for which x1 = x3 = 1, x2 = 0, and xk is arbitrary for all k ≥ 4. The subsets of the Cantor space {0, 1}N having the form N (σ) (for some binary string σ) will be called the basic intervals of the Cantor space. (By defining open sets as finite or infinite unions of basic intervals, the Cantor space becomes a topological space which is compact and metrizable.)
2.2
Lebesgue measure over the unit interval
The problem of Lebesgue measure over the unit interval [0, 1] = {x ∈ R : 0 ≤ x ≤ 1} is essentially a geometric one: Given a subset E ⊆ [0, 1], we want to assign it a number µ(E) which represents its “size” or “length”. For very simple subsets
650
Abhijit Dasgupta
such as an interval, its measure is simply its length: If J ⊆ [0, 1] is an interval with endpoints a ≤ b (i.e., J is one of the intervals (a, b), [a, b), (a, b], or [a, b]), the measure of J, denoted by µ(J), is defined as the length of the interval, µ(J) = b−a. The next step is to define the length of any open subset of [0, 1]. A subset G of [0, 1] is said to be open if it is a union of open intervals. A standard fact about this linear continuum is that any open set can be expressed uniquely as a disjoint union of (possibly infinitely many) intervals. This allows us to naturally and uniquely define the measure µ(G) of an open set G ⊆ [0, 1] to be the sum (possibly as an infinite series) of all the constituent disjoint intervals. A key idea here is that a set of “small measure” can be covered by an open set of “small measure”: A set E is said to be measure-zero if E can be covered by open sets of arbitrarily small measure, i.e., for any ǫ > 0 there is an open set G containing A with µ(G) < ǫ. Slightly more constructively, E has measure-zero if there is an infinite sequence of open sets G1 , G2 , G3 , . . . with each Gn covering E and with µ(Gn ) < 1/n. A subset E ⊆ [0, 1] is defined to be (Lebesgue) measurable if for any ǫ > 0 there is an open set G containing E and an open set H containing the difference GrE with µ(H) < ǫ. Thus a measurable set is one which can be approximated from outside by open sets arbitrarily closely. If E is measurable, it can be shown that the measure of the open set G above approaches a unique limit as ǫ → 0, and we denote this limit by µ(E). This defines the Lebesgue measure for every measurable subset of [0, 1]. If E ⊆ F ⊆ [0, 1] are measurable sets, then we have 0 ≤ µ(E) ≤ µ(F ) ≤ 1.
The class of measurable sets form a vast collection. If E ⊆ [0, 1] is measurable, so is its complement [0, 1]rE with µ([0, 1]rE) = 1 − µ(E). If hEn i is a sequence of measurable sets, then their union ∪n En and intersection ∩n En are also measurable. If the sequence hEn i consists of disjoint measurable sets, then the measure P of their union is the sum of the measures of the individual sets: µ(∪n En ) = n µ(En ).
2.3
Lebesgue measure on [0, 1] as probability
Consider the experiment of choosing a member x of the unit interval [0, 1] in such a fashion that for any two subintervals J1 , J2 ⊆ [0, 1] of equal length, it is equally likely for x to be in J1 as to be in J2 . (This is referred to as the uniform distribution over [0, 1].) For a subset E ⊆ [0, 1], the problem of determining the “probability that x is in E”, denoted by P (E), is identical to Lebesgue’s problem of finding the geometric measure (or length) of E. We therefore identify, for this experiment, the notion of “events” with the class of measurable sets, and the probability of an event E with the Lebesgue measure of E: P (E) = µ(E).
Mathematical Foundations of Randomness
2.4
651
The Cantor space: Infinite sequence of flips of a fair coin
An infinite sequence of flips of a coin, with 1 representing heads and 0 representing tails, can be naturally represented by an infinite binary sequence, i.e., by a member of the Cantor Space {0, 1}N , the set of all possible outcomes in an infinite sequence of flips of a coin. Subsets of {0, 1}N correspond to events, e.g., the basic interval N (101) represents the event that the first and third flips are heads and the second is tails. We stipulate that the coin is fair and the flips are independent by requiring that for each n ∈ N the 2n possible outcomes in the first n flips are equally likely, or equivalently that the probability of the event N (σ) equals 1/2|σ| . Just as we extended Lebesgue measure on [0, 1] from intervals to open sets and then to arbitrary measurable sets, a similar process can be carried out to obtain define a “probability measure” for the “measurable subsets” of {0, 1}N , a process which we now briefly describe. The open sets in {0, 1}N are defined to be arbitrary unions of basic intervals of the form N (σ). A basic interval N (σ) is said to be maximally contained in a set A if N (σ) ⊆ A but N (τ ) 6⊆ A for each proper initial segment τ of σ. Every open set G then decomposes uniquely into a disjoint union of component basic open intervals maximally contained in G. So we can now naturally and uniquely define the measure of G, denote by µ(G), to be the sum of the lengths of these components. A key idea is again that of “small sets”: If ǫ > 0 and if we can cover a set E by an open set of measure less than ǫ, then we can expect the measure of E to be less than ǫ as well. A set E is said to have measure-zero if there is an infinite sequence G1 , G2 , G3 , . . . of open sets each containing E with µ(Gn ) < 1/n for all n.6 Finally, as before in the case of the unit interval, define E ⊆ {0, 1}N to be (Lebesgue) measurable if E can be approximated from outside by open sets arbitrarily closely, i.e., if for any ǫ > 0 there is an open set G containing E and and open set H containing the difference GrE with µ(H) < ǫ. It can then be shown that the measurable sets form a comprehensive collection including the open sets (and so all the basic intervals) and each measurable set E gets naturally assigned a unique measure µ(E) ∈ [0, 1]. Also, the complement of any measurable set E is measurable with µ({0, 1}N rE) = 1 − µ(E), and for any sequence hEn i of measurable sets, their P union ∪n En and intersection ∩n En are also measurable, with µ(∪n En ) = n µ(En ) whenever the sequence hEn i consists of pairwise disjoint sets. Also, if E ⊆ F ⊆ {0, 1}N are measurable, then 0 ≤ µ(E) ≤ µ(F ) ≤ 1. Thus, starting with the simple method of assigning the probability (or measure) 1/2|σ| to each basic interval N (σ), we are then able to naturally extend this assignment procedure to assign probabilities (or measure) to vastly more general types of infinite coin-flip events (measurable subsets of {0, 1}N ). We call this as6 It is easy to see that a singleton is measure-zero. Using convergent infinite geometric series, it now follows that a countable union of measure-zero sets is also measure-zero. Thus all countable sets are measure-zero. A less trivial fact is that there are uncountable measure-zero sets.
652
Abhijit Dasgupta
signment (µ) to be the Lebesgue (or uniform) probability measure on the Cantor Space {0, 1}N . by mapping an infinite binary string x = hx(n)i∞ n=1 to the real number P In fact, x(n) 1/2 ∈ [0, 1], we can naturally identify the uniform probability measure n on the Cantor space with the Lebesgue measure on [0, 1]: The interval [1/2, 1], e.g., corresponds to the event “the first flip is a heads”. This correspondence, as a mapping, is not quite one-to-one since the dyadic rationals of the form m/2n (where m, n are positive integers with 0 < m < 2n ) have two different binary expansions, but these form only a countable set of exceptional points. All other reals in [0, 1] have a unique infinite binary expansion. So this mapping between the Cantor space {0, 1}N and the unit interval [0, 1] is an almost one-to-one correspondence satisfying µ(E) = P (E ′ ) for any measurable subset E ⊆ [0, 1] with E ′ being the set of infinite binary sequences which are binary expansions of the members of E. To summarize, the Lebesgue measure on [0, 1] (i.e., the uniformly distributed probability measure on [0, 1]) and the uniform probability measure on the Cantor space {0, 1}N are essentially the same thing. The zero-one law. This important result (used later) asserts the following for any measurable E ⊆ {0, 1}N . Suppose that whenever x, y ∈ {0, 1}N are sequences differing only at a finite number of places (i.e., (∃m)(∀n > m)(x(n) = y(n))), then x ∈ E ⇐⇒ y ∈ E. Then E has measure either zero or one.
2.5
More general probability distributions
We stress the following: The notion of randomness must be understood relative to an a priori fixed probability distribution for the outcome values. Each probability distribution for the outcome values determines a specific set of random sequences, and a sequence which is random relative to a given probability distribution for the outcomes will not be random relative to a different probability distribution for those outcomes. Without an underlying fixed a priori probability distribution for the possible outcome values, the notion of a “random sequence” would not even make sense. It is thus useful to fix a specific probability distribution on a specific set of outcome values (a probability model) when discussing random sequences. As mentioned earlier, in this article we will restrict our attention to the case where the underlying probability distribution for the blackbox is the fair coin model: Only two equiprobable outcomes 0 and 1 with P (0) = P (1) = 21 . We will then try to find out which sequences are random (i.e. investigate Question B of Section 1.2 and Problems 1 and 2 of Section 1.5) relative only to this fair coin model. This may at first appear to be too severe a restriction, but the extra generality obtained by considering more general probability distributions, while adding verbiage, would not provide much extra insight into the question of randomness. For the reader who is still worried about our restriction to the fair coin model alone, we now mention some technical results which show how this simple case can represent, at least for sequences, other more general distributions. Suppose that our blackbox represents a more general probability distribution,
Mathematical Foundations of Randomness
653
with n outcomes ω1 , ω2 , . . . , ωn , and Pn corresponding probabilities P (ωi ) = pi , i = 1, 2, . . . , n, where 0 < pi < 1 and i=1 pi = 1. Then the probability space P N of all infinite sequences of outcomes is still essentially identical to the Cantor space with Lebesgue measure. In particular, there is a measurable bijection between P N and {0, 1}N which preserves measure. This is a technical result on finite Borel measures on complete separable metric spaces. If p1 , p2 , . . . , pn are computable real numbers, then this bijection can be assumed to be an effective one. If the probabilities p1 , p2 , . . . , pn are dyadic rational numbers (as is the case, e.g., in computer representations of numbers), then one can use an especially simple effective coding by which any sequence (finite or infinite) of outcomes of the blackbox can be represented effectively by a corresponding binary sequence from the fair coin model, while still preserving probabilities of all events. As another example, we mention the so called von Neumann trick, a simple method by which one can “turn a biased coin into an unbiased one”, or, more precisely, simulate sequences of flips of a perfectly unbiased coin using sequences of flips of a biased coin (whose probability of heads differs from probability of tails). For every two consecutive flips, record the outcome as a single 0 if the (ordered) pair of flips is HT, record it as a single 1 if the pair of flips is TH, otherwise discard the pair (the cases HH and TT), and move to the next pair of flips of the coin. We will see later that Martin-L¨of’s definition of randomness is general and flexible enough to be applied directly to very general probability distributions. However, the technical results mentioned here show that there are mathematically sound ways to restrict our attention to the simplified case of the fair coin model, without imposing any serious limitation to the study of random sequences. We adopt this simplification, which equivalently means that the underlying sequence space will always be the Cantor Space with the (uniform) Lebesgue measure. 3
CLASSICAL STOCHASTIC RANDOMNESS IN INFINITE SEQUENCES
We now begin the discussion of the fundamental question stated in Problem 1: Given an infinite binary sequence x = hx1 , x2 , . . . , xn , . . . i ∈ {0, 1}N , how do we determine if it is random?
3.1
Key points
We start by noting several key points of randomness in infinite binary sequences. Recall that the Cantor space with Lebesgue measure is the underlying probabilistic model throughout. Our observations here will be heuristic and informal but of fundamental importance. (a) The probability that a sequence is random equals one, i.e., the random sequences form a set of measure one (full-measure set). A basic intuition here is that the omission or addition of one single bit has no effect on the randomness of an infinite sequence: If x is the sequence x1 x2 x3 . . . , and x′ is the
654
Abhijit Dasgupta
sequence x2 x3 x4 . . . obtained by dropping only the first bit of x, then x′ is random if and only if x is random. It follows by mathematical induction that the randomness of an infinite sequence should depend only on its “eventual behavior” and no amount of finite part can determine the randomness of the entire sequence.7 Thus if two infinite sequences x and y agree on all but a finite number of places (i.e. ∃m ∀n>m (xn = yn )), then the randomness of x is equivalent to the randomness of y. This implies that the set of random sequences satisfies the condition for the zero-one law, and therefore must be either a measure-zero set or a full-measure set. As pointed out by Laplace, it seems natural that the random sequences should form the vast majority of sequences: If we “randomly pick” a sequence in {0, 1}N , or equivalently, generate one by “randomly flipping” a fair coin infinitely many times, the probability that the result is random should be high, and so non-zero. It follows that the random sequences must form a set of measure one. (b) Second, if the sequence of outcomes of a gambling wheel with two equiprobable outcomes is random then no successful betting strategy can be devised against it. Note that, here the bits of an infinite sequence x are thought to be generated by the independent turns of the gambling wheel, and that the randomness of x would imply a strong form of unpredictability for the future bit values of x (from previously observed bit values). More precisely, suppose that a gambling house is generating the infinite binary sequence x and offering the following fair game: A gambler can bet an amount b of money predicting the next bit of x; if the prediction is correct, the gambler wins an amount b, otherwise loses the same amount b. (This rule is tantamount to our underlying assumption of a fair coin model.) We say that the gambler is able to devise a successful gambling system against x, or beat the house against x, if by using a suitable strategy the gambler can start with a finite initial capital and win an arbitrarily large fortune without going bankrupt. By a strategy we mean a finitely specifiable rule which determines how much (and whether) to bet on particular turn based on the outcomes of the previous turns. (These notions will be made more precise later in subsections 5.1 and 11.2.) From the point of view of the house, the randomness of the sequence x of outcomes must imply that the bits of x should be so unpredictable that no gambler would be able to beat the house against x. We carefully note that this impossibility of a successful gambling system is a fundamental necessary condition for randomness of an infinite sequence, (a fact first recognized by von Mises, see subsection 5.1 for quoted comments of Feller).8 7 This
is quite similar to the notion of the limit of an infinite numerical sequence found in elementary calculus: No amount of alternation of the values of any finite number of terms of a sequence can affect its limit. 8 Later, during the attempts in subsections 5.1 and 11.2 to find a suitable definition of randomness, things will be turned around to postulate this impossibility condition as also a sufficient condition for randomness.
Mathematical Foundations of Randomness
655
(c) On the other hand, randomness cannot be identified with complete and absolute lawlessness. We may try to think of a sequence to be random if it satisfies no law whatsoever (“absolutely lawless”). But, as pointed out by Calude in [Calude, 2000] (see also [Volchan, 2002]), no such sequence can exist, since every digit-sequence satisfies the following Ramsey-type combinatorial law first proved by van der Waerden [1927]: The positions (indices) for at least one digit-value will contain arbitrarily long arithmetic progressions. Thus we have to abandon such ideas of “complete lawlessness”. (d) In fact, random sequences will necessarily satisfy certain limiting or stochastic properties. For example, given a sequence x = hxk i, let Sn [x] denote the number of 1s (or Successes) in the first n terms of the sequence x: Sn [x] =
n X
xk = Number of 1s in first n terms of x,
k=1
so that the quantity n1 Sn [x] represents the proportion of 1s in the first n terms of x. This proportion is called the relative frequency or simply the frequency (of successes). If for a sequence x this proportion exceeds 32 (say) infinitely often (i.e., n1 Sn [x] > 32 for infinitely n), then x cannot be random (under the fair coin model), because a gambler would then be able to exploit this “bias within x” to devise a strategy which (starting with a finite initial capital) can return an arbitrarily large fortune without going bankrupt. More precisely, if x is to be random so that no gambling system would be successful against it, then it can be shown that the following condition must hold: For no positive number p should the proportion (or fall below 12 − p) infinitely often.9
1 n Sn [x]
exceed 21 +p
This condition on x is equivalent to requiring that the relative frequency 1 1 n Sn [x] approach the value 2 in the limit as n approaches infinity. Thus, for every random x we have limn→∞ n1 Sn [x] = 12 . This exemplifies that in order to be random, a sequence, instead of being “totally lawless”, must actually satisfy certain stochastic laws of “unbiasedness”. In the rest of this section we will discuss further stochastic laws that should be satisfied by every random sequence, starting with the Borel strong law, whose basic condition is the same as the one in the above example. 9 If 1 S [x] n n
exceeds 12 + p infinitely often for a positive p, then the gambler can beat the house by betting a constant fraction r of his remaining capital at each turn predicting a bit outcome of 1, where r is a constant fraction with 0 < r < p/(1 + p). We omit the details of calculation.
656
3.2
Abhijit Dasgupta
The Borel Strong Law
´ In 1909, Emile Borel [1909] proved a remarkable fact about infinite binary sequences which was later generalized by many mathematicians (from Cantelli to Kolmogorov) and became a fundamental result of probability theory called The Strong Law of Large Numbers. Borel established that with probability one, the proportion of 1s among the first n terms ( n1 Sn ) approaches the value 21 in the limit. THEOREM 1 The Borel strong law. For independent infinite sequences of flips of a fair coin, let B denote the event that the proportion of successes among the first n flips, n1 Sn , approaches the limit 12 as n → ∞, or formally, put 1 Sn [x] . = B = x ∈ {0, 1}N : lim n→∞ n 2 Then the probability of the event B is 1, i.e. the set B has Lebesgue measure 1. We saw that due to impossibility of successful betting strategies, a random sequence must satisfy the condition of the Borel strong law. That, together with our earlier observation that the random sequences form a set of measure one, provides an informal proof for the Borel strong law. (For a formal proof, see [Feller, 1968].) Informally, the Borel strong law asserts that if we randomly pick a member x from {0, 1}N , the probability is one that the relative frequency of 1s among initial parts of x approach a limiting value called the limiting frequency, and this limiting frequency is equal to 21 , so that “randomly picked sequences are unbiased”. Conversely, if for a sequence x this limiting frequency exists but is not equal to 1/2, then, in view of our underlying fair coin model, x would clearly be biased, not random. And if the limiting frequency does not exist then either lim supn n1 Sn [x] > 1 1 1 2 or lim inf n n Sn [x] < 2 , and in either case arbitrarily large segments of x would be biased with arbitrarily great statistical significance, so x again would be nonrandom (successful gambling systems could be devised against x in these cases). Thus, it is natural to view this “stochastic law of unbiasedness” as a “stochastic law of randomness”. Note that satisfying this law is only a basic necessary condition for being random. For example the sequence 010101010101 · · · of alternating 0s and 1s satisfies the condition of the strong law but this sequence is clearly not random. We therefore look for stronger stochastic laws which would exclude such simple examples.
3.3
Borel Normality
Let σ be a fixed binary string with length |σ| = k (e.g., if σ = 00110101 then its length is |σ| = k = 8). If a fair coin is flipped k times, the probability of obtaining σ as the resulting outcome equals 1/2k . For an infinite binary sequence x, consider a block of k bits starting at bit position n: This is the block segment of x given by xn xn+1 . . . xn+k−1 . It is not hard to see that with probability 1 the string σ
Mathematical Foundations of Randomness
657
must occur infinitely many times in an infinite binary sequence.10 Adding this condition (that every finite binary string must occur in x with infinite frequency) as a stochastic law certainly excludes simple sequences satisfying the Borel strong law, such as 01010101 · · · , as being random. In [Borel, 1909], Borel had established an even stronger fact. Call an infinite binary sequence x to be Borel normal in base 2 if for each fixed binary string σ of length |σ| = k, the frequency of occurrences of σ among the first n bits of x (as a fraction of n) approaches 1/2k as n → ∞. (For |σ| = k = 1, this reduces the condition of the Borel strong law.) Borel proved that the probability that x is normal in base 2 is one, i.e. almost all infinite binary sequences are normal in base 2. Any infinite binary sequence satisfying this property will clearly be “much more random” than simple sequences like 01010101 · · · . An immediate consequence of the Borel normality is another form of unpredictability for random sequences: The observation of any particular bit pattern in a random sequence does not influence the value of the next bit. More precisely, for almost all infinite binary sequences x and any fixed finite binary string σ (bitpattern), the limiting frequency that σ is immediately followed by 0 equals the limiting frequency that σ is followed by 1. An explicit example of an infinite binary sequence which is normal in base 2 is the Champernowne binary sequence obtained by concatenating the binary representation of every non-negative integer (taken in their natural order): 01101110010111011110001001101010111100110111101111 · · · As far as randomness is concerned, this is a big improvement over the simple 01010101 · · · , but it is impossible to call this sequence random, since it is also generated by a simple effective procedure. Recall that one can identify {0, 1}N with the unit interval [0, 1] by viewing infinite binary sequences as binary expansions of reals in [0, 1] (after disregarding a negligible countable subset of {0, 1}N ). This identification preserves measurable sets and the measure of every such set. Moreover, given an integer base b > 1, every real x ∈ [0, 1] can be expanded in base b as: x=
∞ X
k=1
xk bk ,
hxk i ∈ {0, 1, . . . , b − 1}N .
The terms of the infinite sequence hxk i ∈ {0, 1, . . . , b − 1}N above are known as the b-ary digits of x. E. g., b = 2 for binary and b = 10 for decimal expansion. For x ∈ [0, 1] with base b expansion hxk i, we say that x is Borel normal in base b if for each fixed finite string σ ∈ {0, 1, . . . , b − 1}∗ of length |σ| = k, the 10 Proof: Divide x into consecutive blocks of size k each. For any n, the probability that σ is not the n-th block equals the constant r = 1 − 1/2k . So the probability that σ does not occur as one in a run of m consecutive blocks is r m . Since r < 1, r m → 0 as m → ∞, and thus for any n the probability that σ does not occur after the n-th block is zero. It follows that the probability that σ occurs in only finitely many blocks is also zero, QED.
658
Abhijit Dasgupta
frequency of occurrences of σ among the first n b-ary digits of x (as a fraction of n) approaches 1/bk as n → ∞. Finally, x is absolutely normal if x is Borel normal in every base b > 1. Note that by the identification of {0, 1, . . . , b − 1}N with [0, 1], these definitions apply to infinite sequences as well. Two examples numbers Borel normal in base 10 are the following reals shown in decimal expansion: 0.12345678910111213 · · · (Decimal Champernowne number), 0.235711131719232931 · · · (Decimal Copeland-Erd¨os number). For the first number above, the decimal digits after the decimal point are formed by concatenating the positive integers written in decimal notation (in their natural order), while the second one has decimal digits formed by concatenating the positive primes written in decimal notation. While both these are Borel normal in base 10, it is not clear if they are Borel normal in any other base. Borel’s result implies that almost all reals in [0, 1] (almost all infinite binary sequences) are actually absolutely normal. But it is harder to come up with examples of absolutely normal numbers, as they have an even higher degree of randomness compared to the Champernowne number √ or the Copeland-Erd¨os number. It is an old open question if the number π (or 2, or e) is absolutely normal, or even normal in some base whatsoever. Following early work of Sierpinski [1917] (also Turing [1992]), Becher and Figueira [2002] have constructed absolutely normal numbers using an effective procedure (these are real numbers which are computable, although somewhat complicated to define).
3.4
Laws of random walk (“wandering drunkard” laws)
One can identify the infinite binary sequences (or, equivalently, all possible outcomes of an infinite sequence of coin flips) with the set of all random walks. Consider the number line infinitely extended in both directions and indexed by the integers Z, and a person starting at 0 taking a sequence of steps determined by x ∈ {0, 1}N as follows: The first step is one unit to the right (in the positive direction) if x1 = 1 and one unit to the left if x1 = 0; and the n-th step of the person is similarly determined by the value of xn . If x is random, then this results in a random walk (sometimes called the drunkard’s walk). If Sn [x] denotes the number of 1s (Successes) in the first n bits of x and Fn [x] = n − Sn [x] the number of 0s (Failures) in the first n bits of x, then the position of the person on the number line after step n is given by: Sn [x] − Fn [x] = 2Sn [x] − n. It is not hard to see that with probability 1, the person must move away an arbitrarily large distance to the right of the starting point, and also an arbitrarily
Mathematical Foundations of Randomness
659
large distance to the left of the starting point.11 Formally, in terms of infinite binary sequences, the set N x ∈ {0, 1} : sup Sn [x] − Fn [x] = +∞ and inf Sn [x] − Fn [x] = −∞ n
n
must have measure (probability) one. In particular, this stochastic law implies that for a walk to be random, the person must oscillate about the origin with “arbitrarily large amplitudes”, and must “cross the origin” from right to left and from left to right infinitely many times: THEOREM 2 Law of Symmetric Oscillation in Random Walks. If x ∈ {0, 1}N is random, then we must have: Sn [x] 1 > for infinitely many n, as well as: n 2
Sn [x] 1 < for infinitely many n. n 2
Thus a walk in which the person eventually stays on one side of the origin (eventually to the right or eventually to the left) is “biased” and cannot be “random”.
3.5
The Law of Iterated Logarithms and Strong Normality
The Borel strong law says that n1 Sn converges to the expected value 21 with probability 1, but it does not say how the variance (or standard deviation) of Sn behaves asymptotically. A much more precise and stronger theorem (than the Borel strong law) is the Law of Iterated Logarithms, which gives an exact asymptotic bound on the deviation of Sn : With probability √ one, the values of Sn spread around its mean to an asymptotic distance of 2 log log n times its standard deviation. To state it more √ precisely, note that the mean of Sn is µn = n/2, and its standard deviation σn = n/2. Then the Law of Iterated Logarithms √ asserts the following: For any λ > 1, the probability is one that Sn < µn + λ 2 log log n σn for √ all but finitely many n, and for λ < 1, the probability is one that Sn > µn + λ 2 log log n σn for infinitely many n; and similarly for the lower bounds of Sn . (See Feller [1968] for proof). The Law of Iterated Logarithms has both the Borel strong law and the Law of Symmetric Oscillation in 3.4 as immediate corollaries. 11 Proof (outline): Computing binomial probabilities, the probability that |S [x] − F [x]| ≤ k n n for all but finitely many n equals zero for each k, and hence the probability that the sequence hSn [x] − Fn [x] : n ∈ Ni is unbounded equals one. Now the events supn Sn [x] − Fn [x] = +∞ and inf n Sn [x] − Fn [x] = −∞ are equiprobable and satisfy the conditions of the zero-one law. So either both events have probability zero, or both have probability one. But it is impossible for both these events to have probability zero, as that would imply boundedness of the sequence hSn [x] − Fn [x] : n ∈ Ni with probability one (a contradiction), and the result follows.
660
Abhijit Dasgupta
Belshaw and Borwein [2005; ] have used a slightly weaker version of the Law of Iterated Logarithms to define the notion of Strong Normality. They show that requiring a sequence to be strongly normal makes it more random than requiring it to be simply Borel normal. This is established both by graphic empirical evidence and by proving that the Champernowne binary sequence is not strongly normal. Since every sequence satisfying the Law of Iterated Logarithms is strongly normal, it follows that the Champernowne binary sequence violates the Law of Iterated Logarithms.
3.6
Equidistribution laws and the Ergodic Frequency Theorem
For any statement P , we use the notation JP K to denote the binary truth value of the statement P , i.e., ( 1 if P is true . JP K = 0 if P is false Let hxn i be an infinite sequence of real numbers in [0, 1). We say that hxn i is equidistributed (or uniformly distributed ) if for all 0 ≤ a < b ≤ 1, n
1X Jxk ∈ [a, b)K = measure of [a, b) (= b − a). n→∞ n lim
k=1
In other words, hxn i is equidistributed if the limiting frequency with which xn enters the interval [a, b) equals the size of [a, b). Similarly, a sequence hxn i of infinite binary sequences (i.e. each xn ∈ {0, 1}N ) is equidistributed if for each finite binary string σ, n
1X Jxk ∈ N (σ)K = measure of N (σ) n→∞ n lim
k=1
=
1 2|σ|
,
where N (σ) is the set of all infinite binary sequences having σ as an initial segment. Equidistribution can also viewed as a form of unbiasedness: Every subinterval asymptotically gets its “proper share” of the sequence. A great deal of classical mathematical literature exists on equidistribution (see [Kuipers and Niederreiter, 1974]). We mention a theorem of Weyl: If han i is a sequence of distinct integers, then the sequence hFRAC(an x)i is equidistributed for almost all x in the unit interval. (Here FRAC(x) denotes the fractional part of x ∈ R: FRAC(x) = x − ⌊x⌋, where ⌊x⌋ is the floor of x, or the greatest integer not greater than x.) Borel normality can be viewed as a special case of the Weyl equidistribution theorem just mentioned. Taking an = 2n−1 , we see that hFRAC(2n−1 x)i is equidistributed over the unit interval for almost all x. Moving to {0, 1}N , this implies that for almost all x = hxn i ∈ {0, 1}N , the sequences hx1 , x2 , x3 , . . . i, hx2 , x3 , x4 , . . . i, hx3 , x4 , x5 , . . . i, etc, are equidistributed over {0, 1}N .
Mathematical Foundations of Randomness
661
Let L : {0, 1}N → {0, 1}N denote the left-shift operator: L(hx1 , x2 , x3 , . . . i) = hx2 , x3 , x4 , . . . i.
Then we may restate this equidistribution by saying that for almost all x ∈ {0, 1}N , the sequence hx, Lx, L2 x, . . . i is equidistributed. In other words, for every finite binary string σ, n−1 y 1 Xq k 1 lim L x ∈ N (σ) = measure of N (σ) = |σ| , n→∞ n 2 k=0
for almost all x ∈ {0, 1}N . But this is precisely the statement that almost all infinite binary strings are normal in base 2. The left-shift operator L is an example of an ergodic operator, and the last displayed equation is a special case of the Birkhoff Ergodic Theorem. In subsection 11.6, we further discuss how this approach can be used as as a stochastic law for specifying randomness.
3.7
General probabilistic laws for specifying randomness
In general, suppose that a specific law L for sequences (i.e. a specific property L of sequences) is satisfied with probability one, i.e. satisfied by almost all sequences. We have seen many examples such laws (“laws of large numbers”): The Borel strong law, Borel normality, Symmetric Oscillations, Iterated Logarithms, etc, form an increasingly stringent sequence of such laws. Let us call any such law to be a probabilistic law of randomness. Formally, a probabilistic law of randomness, or simply a randomness law, is an explicitly defined set L of binary sequences having full-measure. Given such a law L, the probability that a random sequence will satisfy it is 1, so we can make L to be another law for randomness. By specifying more and more stringent randomness laws, we can try to verify that a given sequence is sufficiently random in this sense. In fact, any algorithm for generating pseudo-random sequences needs to be subjected to a series of such theoretical stochastic tests12 to estimate the “quality of randomness” for sequences generated by the algorithm. After a sufficient number of such stages of refinement, we can hope to arrive at the “right” definition of randomness for infinite sequences — a definition which best matches our intuition. This is the approach taken by Knuth in his fascinating article What Is a Random Sequence? (Section 3.5 in [Knuth, 1998]). Knuth gives a series of definitions, R1–R6, and claims (or hopes, as he clarifies in a later footnote) that the final refinement (definition R6) is an appropriate definition of randomness. 12 This
should not be confused with statistical tests for randomness in a finite sequence of digits. As mentioned earlier, for an infinite sequence, no amount of finite part can determine if the entire sequence is random. Statistical tests, in particular the χ2 -tests, are important for estimating “randomness confidence” for a finite set of data. See Knuth [Knuth, 1998] for a list of statistical tests of randomness, and p. 80 of [Knuth, 1998] for theoretical tests. John Walker distributes a suite of statistical tests for randomness.
662
3.8
Abhijit Dasgupta
Is absolute probabilistic randomness possible?
What happens if we define a sequence to be random if it satisfies all explicitly defined probabilistic randomness laws? Provisionally, let us call such a sequence absolutely random. Unfortunately, such a definition causes problems. First, we have an apparent paradox: For an arbitrary binary sequence y, consider the randomness law Ly := {x ∈ {0, 1}N : x 6= y}. If x satisfies all randomness laws, then x ∈ Ly for all y ∈ {0, 1}N , but this implies x ∈ Lx , so x 6= x, a contradiction! It follows that no binary sequence can satisfy all randomness laws! However, this is not a real paradox. Since every law can be written using a finite sequence of symbols from a finite alphabet, there are at most countably many laws that can be explicitly defined. For all but a countably many members y ∈ {0, 1}N , the law Ly cannot even be explicitly stated, and hence will not count. But, second, we run into a more serious technical metamathematical problem: The notion of “all randomness laws”, which makes good intuitive sense, is not formalizable in the standard system of axiomatic mathematics (ZFC). The reason for this is that the notion of a randomness law really refers to a definable subset of {0, 1}N of full-measure, but the notion of definability itself can only be defined in terms of satisfaction, or truth. By a classic result of Tarski, it is impossible to formalize the notion of truth in a formal system within the system. One way out from this metamathematical difficulty is to not talk about the entire class of all definable subsets of {0, 1}N , but restrict ourselves to those which can be defined using formulas with a limited number of quantifiers. By restricting the number and scope of the quantifiers in the defining formulas, various hierarchies of definable subsets of {0, 1}N are obtained, such as the arithmetical hierarchy (scope of quantifiers limited to the set of natural numbers), and the analytical hierarchy (scope of quantifiers limited to natural and real numbers). This means that we must be satisfied with relative degrees of randomness, and Absolute Randomness, like truth in formal systems, must remain elusive forever. This is the approach taken by modern research. Very roughly, the n-th level of the arithmetical hierarchy consists of sets defined by formulas with n alternating quantifiers ranging over natural numbers. For each n, a notion of n-randomness can be defined appropriately. The higher the value of n, the stronger is the randomness. The lowest level of this hierarchy, which defines 1-randomness by suitably considering algorithmic randomness laws, is especially important. It captures the random sequences by defining them as the ones which satisfy all “algorithmic stochastic laws converging algorithmically.” But we first need to rigorously define the concept of algorithm to talk about these notions precisely. 4
ALGORITHMS AND POST MACHINES
Like randomness, the intuitive notion of algorithm (or effective computation) was not easy to capture. During the early part of the 20th century, in response to
Mathematical Foundations of Randomness
663
Hilbert’s program, there was a great deal of effort by mathematicians to come up with a precise definition of the term “algorithm”. During the 1930’s decade, which began with G¨ odel’s celebrated negative answer to Hilbert’s program, several mathematicians independently came up with such definitions. These included, in addition to G¨ odel, Herbrand, Church, Kleene, Post, and of course, Turing. After sorting out more restrictive definitions (such as those now known as primitive recursive computation), mathematicians converged on a definition that Church (in 1936) announced as the appropriate definition for algorithm. Church’s assertion is known as Church’s Thesis or the Church-Turing thesis. This is the definition that is accepted today, and so the Church-Turing thesis turned out to be highly successful. Most remarkably, almost all independent definitions coincided and produced the same characterization of the notion of algorithm! We present this definition as a variant of Post’s original version (see also [Uspensky, 1983; Davis, 1980]).
4.1
Post machines and programs
A Post Machine consists of a control part containing a program (a finite list of instructions described below), a bidirectional infinite tape of memory bits, and a head (shown as ⇑) which at any time is located at some unique bit position (the “current bit”) of the tape. ← Bidirectional tape of bits (memory) →
... 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 0 0 ... 0 1 2
⇑
RIGHT FLIP IF1GOTO:0
Program (control)
The Post Machine The head can read the bit at the current position as either 0 or 1 and inform this value to the control, and at the direction of control it can either change the bit value at the current position, or move to one bit position to the left, or to one position to the right. The value of every tape bit is always either 0 or 1 (there are no blank symbols). Instruction FLIP LEFT RIGHT IF1GOTO:n
Code 0 1 2 n+3
Function Complement current bit: 0 becomes 1 and 1 becomes 0 Move the head one bit position to the left Move the head one bit position to the right Conditional jump: If current bit = 1, GO TO instruction # n
Table 1. Post Machine Instruction set and their numeric encodings
664
Abhijit Dasgupta
The Post Machine has four13 types of instructions, as listed in the table. A Post Machine program is simply a finite list of instructions as given in Table 1. Equivalently, since every instruction corresponds uniquely to a natural number code in Table 1, a program can be defined as a finite sequence of natural numbers. In addition, every instruction in the program is assumed to be labeled serially by its “line number”, starting with 0 for the first instruction. Program execution starts at line number 0 and sequentially proceeds to the following line, except possibly for the IF1GOTO:n instruction. There is no explicit STOP or HALT instruction, and the program terminates whenever it is unable to execute the “next instruction”. Here is an example program and its code as a finite sequence of natural numbers: 0 1 2 3 4
RIGHT IF1GOTO:0 FLIP LEFT IF1GOTO:3
=
h2, 3, 0, 1, 6i.
We encode natural numbers as strings by the correspondence n ↔ 1n+1 0: 0 ↔ 10, 1 ↔ 110, 2 ↔ 1110, 3 ↔ 11110, . . . Given a program P and a positive integer k, we will now define a partial function ϕ of k natural number arguments. To find ϕ(n1 , . . . , nk ), we start the program with the input string 1n1 +1 0 . . . 1nk +1 0 on the tape and the head, as in: n1
n2
z }| { z }| { ...01 1 ... 1 0 1 1 ... 1 0 ... ⇑
P
nk
z }| { ... 1 1 ... 1 00...
(other tape bits are 0).
There are now two possibilities, and we define ϕ(n1 , . . . , nk ) accordingly: Case 1 The program P terminates when started as above. We then define ϕ(n1 , . . . , nk ) = m, where m is the length of the run of 1s to the right of the head, before the first 0 to the right of the head, as in: m
z }| { ... ∗ ∗ 1 1 ... 1 0 ∗ ... ⇑
P
(We express the situation in this case by saying that the program P halts on the inputs n1 , . . . , nk with output m.) Case 2 The program P started as above does not terminate (“loops forever”). In this case we leave ϕ(n1 , . . . , nk ) undefined. 13 One can combine the first two instructions into a single one to have an adequate version of the Post Machine with only three types of instructions.
Mathematical Foundations of Randomness
665
Thus, for every program P and positive integer k, there is a unique k-ary partial function ϕ determined as above, and we express this by saying ϕ is the k-ary partial function computed by P , or P computes the k-ary partial function ϕ. For example, the example program above computes the 1-ary function f (n) = n + 2, but it also computes the 2-ary function f (m, n) = m + n + 3. As simpler examples, note that the single line program with the only instruction LEFT computes the (1-ary) successor function s(n) = n + 1, and the empty program computes the identity function f (n) = n. We can now formally define the notion of “algorithm” or effective computability: DEFINITION 3. A k-ary partial function f = f (n1 , n2 , . . . , nk ) is effectively partial computable or simply partial computable if there is Post machine program P such that P computes f . If f is total, we say that f is effectively computable, or simply computable. A subset R ⊆ Nk (i.e., a k-ary relation R, which is simply a set of natural numbers if k = 1) is called effectively computable or algorithmically decidable or simply computable if its characteristic function is a computable function. R ⊆ Nk is called computably enumerable (or c.e.) if it equals the domain of some k-ary partial computable function. (In older literature the term “recursive” is used in place of “computable”.) It can be shown that a set A is c.e. iff it equals the range of a partial computable function f . If the set A is non-empty, the function f can be assumed to be total. This explains the terminology: A set E is computably enumerable if its elements can be enumerated by a computable function f , as in E = {f (1), f (2), f (3), . . . }. Another important fact is that a set is computable iff both the set and its complement are c.e. Most importantly, we want to note that the notions of effectively computable functions and computable sets, c.e. sets, etc, are all independent of the particular model of computation. In particular, if a function can be computed by some other computer, however powerful, it will be also computable by some Post machine program. Since there are only countably many programs but uncountably many sets and functions of natural numbers, it follows that most functions are not computable and most sets are not decidable (not even c.e.). We will see some specific examples soon.
4.2
G¨odel numbering Post machine programs
Since a program P is a finite sequence of natural numbers, say P = hp1 , p2 , . . . , pm i, each program is easily coded into a single natural number in an effective manner. One way to do this would be to build an integer e(P ) from P = hp1 , p2 , . . . , pm i in binary notation as follows: Write 1 followed by pm zeros, followed by another 1 and pm−1 more zeros, and so on, ending with 1 followed by p1 zeros. Finally,
666
Abhijit Dasgupta
convert this binary string into an integer in the usual way. In other words, e(P ) = 2p1 + 2p1 +p2 +1 + 2p1 +p2 +p3 +2 + · · · + 2p1 +p2 +···+pm +(m−1) . e(P ) is called the G¨ odel number of P . For example, the G¨odel number of the example program P = h2, 3, 0, 1, 6i is 66244 (in decimal). Note that G¨odel number of the empty program (which computes the identity function) is 0. There is an equally effective procedure to convert any natural number to the corresponding Post machine program. For example, given the integer 140 (in decimal), we first write it in binary as 10001100, and then read off the number of consecutive zeros after each 1, starting from the rightmost 1 and moving to the left. This gives us the sequence h2, 0, 3i, which decodes into the program h2, 0, 3i =
0 1 2
RIGHT FLIP IF1GOTO:0
.
It is another remarkable fact that there is Post machine program U which, given a finite sequence of numbers as input arguments, treats the first argument as the G¨odel number of a program P , and then (by decoding the first argument) is able to simulate P on the remaining arguments. Such programs U are called universal. THEOREM 4 Universal Programs and Computable Functions. There is a (universal) Post machine program U which computes a partial function Φ(m, n) of two variables with the following property: For every Post machine program P with G¨odel number e = e(P ), the 1-ary function ϕ computed by P equals the function Φe defined by Φe (n) = Φ(e, n), i.e. ϕ = Φe . In particular, every single-variable partial computable function f equals Φe for some e. Now put We = dom Φe , and recall that a set is called c.e. if it equals the domain of some partial computable function. Since the sequence hΦe : e ∈ Ni contains all partial computable functions, so the sequence hWe : e ∈ Ni forms a list of all c.e. sets. We define the special set ∅′ as ∅′ = {e : e ∈ We }. This is an example of a c.e. set whose complement is not c.e. To see this suppose the complement of ∅′ is a c.e. set so that for some e we have n ∈ We ⇐⇒ n 6∈ ∅′ for all n. Taking n = e we get e ∈ We ⇐⇒ e 6∈ ∅′ ⇐⇒ e 6∈ We , a contradiction. It follows that ∅′ is not computable. The domain of the function Φ is called HALT, since Φ(m, n) is defined iff the program with G¨ odel number m halts on input n. HALT is a c.e. set by definition, with n ∈ ∅′ ⇐⇒ (n, n) ∈ HALT, for all n.
Therefore if HALT were computable, so would be ∅′ , which we have seen to be non-computable. Hence HALT itself is non-computable. This is often expressed by saying that the Halting Problem is uncomputable.
Mathematical Foundations of Randomness
667
We thus have examples of c.e. sets HALT and ∅′ which are not computable, and so their characteristic functions would be examples of non-computable functions.
4.3
Computation on strings
We have been using numbers for computability notions. We now fix an effective one-to-one correspondence between the natural numbers and the binary strings so that computability notions can be extended to strings. DEFINITION 5. We fix the following one-to-one correspondence between the natural numbers and the binary strings: 0 ↔ Λ, 1 ↔ 0, 2 ↔ 1, 3 ↔ 00, 4 ↔ 01, 5 ↔ 10, 6 ↔ 11, 7 ↔ 000, . . . Given a number n, the corresponding binary string is denoted by str(n). Given a binary string σ, the corresponding number is denoted by num(σ). This correspondence is effective: An algorithm for converting n to str(n) is obtained as “write n + 1 in binary notation (without leading zero) and then drop (erase) the leading 1”. An algorithm for converting σ to num(σ) is obtained as “prefix σ with an additional 1, regard the resulting string as and integer m written in binary notation, and have the final result to be m − 1.” With this effective correspondence, we can extend every computability notion for numbers into one for strings, by converting (translating) between the two types back and forth as needed. In particular, we will say “the program P halts on input string δ with output string σ” to really mean “the program P halts on input num(δ) with output num(σ)”. Another example: We will say that a function f : {0, 1}∗ → {0, 1}∗ is effectively computable to really mean that the function g : N → N defined by the “translations” g = num ◦f ◦ str (i.e., g(n) = num(f (str(n)))) is effectively computable. We will say that a set of strings is c.e., if the corresponding set of numbers is c.e. We could also as easily define computability notions for functions from numbers to strings and vice versa, and in general for subsets of Nm × ({0, 1}∗ )n .
4.4
Effective topological notions
The idea of effective computability has been extended to topological notions and is used extensively in an area called descriptive set theory. Many classical notions of analysis get refined this way. Here we describe two such notions, effective open and uniformly effective open. Recall S that an open set G in the Cantor space is a union of basic intervals, G = σ∈S N (σ) for some set S of strings. We refine this definition by requiring the index set over which the union is taken to be computably enumerable:
668
Abhijit Dasgupta
DEFINITION 6 Effective Open Sets. A subset G of {0, 1}N is called effective open or computably enumerable iff there is a c.e. set S of strings such that [ G= N (σ). σ∈S
A c.e. set of strings is one whose members can be listed by a program (or a partial computable function), so a set G is effective open iff there is a program which prints a list of basic intervals whose union equals G. More precisely, G is effective open iff there is a partial computable function σ : N → {0, 1}∗ such that G=
∞ [
N (σ(n)),
n=1
where we take N (σ(n)) = ∅ if σ(n) is not defined. Finally we define what it means to say that the sets G1 , G2 , G3 , . . . are uniformly effective open. To be uniformly effective open, it is not enough that each set Gn in the sequence be individually effective open, but the entire sequence must be listed together in an effective way, i.e., there should be a single program listing a doublesequence of basic intervals whose unions form these sets. More precisely: DEFINITION 7 Uniformly Effective Open Sets. A sequence of sets G1 , G2 , . . . is uniformly effective open if there is a partial computable function σ : N × N → {0, 1}∗ such that ∞ [ N (σ(m, n)), Gn = m=1
where we take N (σ(m, n)) = ∅ if σ(m, n) is undefined.
5
VON MISES’ DEFINITION OF RANDOM SEQUENCE
Undoubtedly, one of the most successful achievements of twentieth century mathematics was the measure-theoretic axiomatization of probability. Proposed by Kolmogorov in 1933, it soon became the essential basis for the study of mathematical probability theory. The generality and elegance of this abstract axiomatic approach — a hallmark of modern formalism mathematics — found wide applicability in almost every situation in probability theory. However, this method with its formalist nature does not directly involve the notion of randomness in any fundamental way, and the rather intuitionistic problem of defining randomness did not arise in its course of development. Richard von Mises was the first person to focus clearly and deeply on the mathematical essence of randomness in sequences. His pioneering work early in the twentieth century on developing a frequentist theory of probability ([Von Mises, 1919; Von Mises, 1981], already shed considerable light on the heart of the matter of
Mathematical Foundations of Randomness
669
randomness, as Kolmogorov himself has remarked [Li and Vitanyi, 2008, p.50]. For von Mises, random sequences, which he called “collectives”, formed the essential basis of his frequentist theory. Subsequently, his work led other mathematicians to carry out further investigations that have clarified the notion of randomness significantly. The far reaching implications of his work is still deeply influencing current research on randomness. Kolmogorov is known as the father of modern mathematical probability theory, and it is perhaps appropriate to call von Mises the founder of the modern mathematical theory of randomness in infinite sequences. Here we only briefly outline the central ideas of von Mises concerning random sequences. For more details, see his own works [Von Mises, 1919; Von Mises, 1981] and works of van Lambalgen [1987a; 1987b; 1996; 1990].
5.1
Von Mises’ definition
The condition of the Borel strong law can be stated as saying that the relative frequency of successes in initial parts of the infinite binary sequence under consideration should have a limiting frequency of 1/2. Recall that this is a condition for unbiasedness (subsection 3.2). The fundamental intuition of von Mises is often summarized as the invariance of limiting frequency under (admissible) place selections. To understand what this means, consider an infinite sequence of turns of a gambling wheel — turn 1, turn 2, and so on — with each turn having two equiprobable outcomes 0 or 1. Let the sequence x = hxn i denote the outcomes of the turns (if the n-th turn of the wheel produces a 0, then xn = 0, else xn = 1). Suppose that a gambler is observing the outcomes of the turns, and before every turn the gambler decides whether to bet on that turn or not, perhaps basing the decision on the finite past history of earlier outcomes of the turns. Example 1: The gambler chooses to bet on every third turn (turns 3, 6, 9, etc), disregarding earlier history of outcomes altogether (“lucky-third” rule). Example 2: The gambler chooses to bet after any run of five consecutive 0s. In any case, bets may not get placed on every turn, but rather on selected turns determined by the gambler’s strategy. This results in a subsequence of turns selected for placing bets, say turns n1 < n2 < · · · < nk < . . . , as shown below: 1 skip
2 skip
... ...
n1 − 1 skip
n1 bet
n1 + 1 skip
n1 + 2 skip
... ...
n2 − 1 skip
n2 bet
··· ···
Using the strategy, the gambler selects turn n1 for placing his first bet, turn n2 for the second bet, etc. In the first example, the subsequence of selected turns for betting is 3 < 6 < 9 < . . . , independent of the outcome of the turns. In the second example, the turns on which the gambler bets will depend on the particular infinite sequence of turn outcomes. Once the turns for betting are all selected, suppose we restrict to the outcome values at only these selected turns — discarding (erasing) the outcome values of those turns on which bets are not placed — and then compute the limiting
670
Abhijit Dasgupta
frequency for this new restricted subsequence of outcomes. According to von Mises, if the original sequence of outcomes were random, then this new limiting frequency would still equal 1/2, regardless of the gambling strategy being used! Moreover, this crucial property is the essence of randomness, and therefore characterizes it: DEFINITION 8 Von Mises Randomness, Initial Version. An infinite binary sequence x ∈ {0, 1}N is random if whatever be the gambler’s strategy and the resulting turns n1 < n2 < . . . selected for placing bets, the subsequence of x obtained by restricting to these selected turns still has limiting frequency 1/2, i.e.: m
1 X 1 xnk = . m→∞ m 2 lim
k=1
In von Mises’ terminology, a “turn” (on which the gambler may or may not bet) is called a place, and a “strategy” by which the gambler selects which turns to bet on, is called a place selection rule. The “invariance of limiting frequency under admissible place selections” can now be understood as a form of unpredictability arising from unbiasedness: No betting strategy of place selections can succeed by improving predictability within a random sequence, since such selections will leave unbiasedness intact (identical limiting frequency for the resulting subsequence). In other words, not only is the entire sequence unbiased (limiting frequency of 1/2), but there is no hidden biased or unstable subsequence that can be found (by a gambler) using a suitable strategy of place selection.14 We quote some relevant remarks of Feller: “The painful experience of many gamblers have taught us the lesson that no system of betting is successful in improving the gambler’s chances.” [Feller, 1968, VIII.2, p. 198] . . . “[U]nder any system the successive bets form a sequence of Bernoulli trials with unchanged probability of success. . . . The importance of this statement was first recognized by von Mises, who introduced the impossibility of a successful gambling system as a fundamental axiom.” [Feller, 1968, VIII.2, p. 199] “Taken in conjunction with [the] theorem on impossibility of gambling systems, the law of the large numbers implies the existence of the [limiting frequency] not only for the original sequence of trials but also for all subsequences obtained in accordance with the rules of [place selection].” [Feller, 1968, VIII.4, p. 204] 14 For simplicity we are restricting (by using the von Neumann trick if necessary) only to the special limiting frequency value of 1/2 instead of the general value p (0 < p < 1) used by von Mises. For us, this does not cause much loss of generality as we are focusing only on randomness. Von Mises’ chief objective was to develop the frequentist theory of probability.
Mathematical Foundations of Randomness
5.2
671
Mises-Wald-Church randomness
We first formalize the earlier definition of von Mises Randomness. A place selection rule or betting strategy is a partial function ϕ : {0, 1}∗ → {0, 1}. (It tells the gambler when to bet: For a binary sequence hx1 , x2 , . . . , xn , . . . i of outcomes, the n-th turn is selected for betting according to the strategy ϕ if and only if ϕ(hx1 , x2 , . . . , xn−1 i) = 1.) Given a place selection rule ϕ and x ∈ {0, 1}N such that ϕ(hx1 , x2 , . . . , xn−1 i) = 1 for infinitely many n, let n1 be the least n for which ϕ(hx1 , x2 , . . . , xn−1 i) = 1, n2 be the next such n, etc. Then the sequence hxn1 , xn2 , . . . , xnk , . . . i is called the ϕ-selected part of x. (It is the subsequence of x obtained by restricting x to those indexes which are selected for betting according to ϕ.) Thus we say that the ϕ-selected part of x has limiting frequency 1/2 iff m
lim
m→∞
1 X 1 xnk = . m 2 k=1
We can now try to restate the definition of von Mises randomness as follows. A sequence x ∈ {0, 1}N is random iff for all place selection rules ϕ for which ϕ(hx1 , x2 , . . . , xn−1 i) = 1 for infinitely many n, the ϕ-selected part of x has limiting frequency 1/2. We run into a problem with this definition with its unrestricted use of the universal quantifier in the clause “for all place selection rules”. Given any sequence x ∈ {0, 1}N , put A = {n : xn = 1} if this set is infinite, otherwise put A = {n : xn = 0}. Define ϕ by the condition that ϕ(hy1 , . . . , yn−1 i) = 1 if n ∈ A and = 0 otherwise. It is easy to that the the ϕ-selected part of x has a limiting frequency equaling either 0 or 1, violating randomness. It follows that no sequence is random! However, as von Mises points out, this is not a real problem. The defect of the argument is that the rule ϕ used in the argument selects a place n based on the outcome value xn , and such rules are of course not allowed [Von Mises, 1981, p. 25]. The place selection rules in the definition of randomness are restricted to only certain admissible rules, instead being completely arbitrary, and the problem is resolved.15 Here is the corrected definition in its original intended form: DEFINITION 9 Von Mises Randomness. A sequence x ∈ {0, 1}N is random iff for all admissible place selection rules ϕ for which ϕ(hx1 , x2 , . . . , xn−1 i) = 1 for infinitely many n, the ϕ-selection of x has limiting frequency 1/2. 15 From
certain sections of von Mises’ detailed discussion of the concept [Von Mises, 1981], it is also clear that he wants a place selection rule, or betting strategy, to be “specifiable in some effective manner”. In the modern language of mathematical logic, that could be interpreted as some notion of effective description (say as being effectively computable or at least being definable in some definite language), but von Mises does not precisely specify any such rigorous criterion for defining “admissible”.
672
Abhijit Dasgupta
Abraham Wald [1936] then showed that whenever the set of admissible place selection rules is countable, random sequences according to the von Mises definition do exist, and form a set of full-measure. This implies that if “admissible” is taken to mean any form effective specifiability using finite sequence of symbols from a finite alphabet (such as effective computability), then the set of admissible rules remains countable, and so von Mises random sequences would exist. In 1940, Alonzo Church [1940] proposed the use of effectively computable (total) place selection rules for precisely defining von Mises randomness. Such random sequences are now called Church randomness or Church stochastic. Using partial computable place selection rules, we have the following definition: DEFINITION 10 Mises-Wald-Church Randomness. A sequence x ∈ {0, 1}N is Mises-Wald-Church random iff for all partial computable place selection rules ϕ for which ϕ(hx1 , x2 , . . . , xn−1 i) = 1 for infinitely many n, the limiting frequency of x under ϕ-selection is 1/2. Thus, in 1940, the first precise and rigorous mathematical definition of randomness for infinite sequences was found. In recent literature, the term stochastic is used for randomness defined using limiting frequency after place selections, and so Mises-Wald-Church random sequences are now also called Mises-Wald-Church Stochastic. As explained in Section 3, any notion of randomness must be subjected to the fundamental stochastic laws such as Borel strong law, Borel normality, Symmetric Oscillation, etc. It is easy to see that every Mises-Wald-Church random sequence satisfies the condition of the Borel strong law, since the place selection rule ϕ defined by ϕ(σ) = 1 for all σ is computable, and the resulting subsequence is simply the entire original sequence. It can also be shown that Mises-Wald-Church random sequences are Borel normal. But a big blow to the definition came when in 1939 Ville [1939] proved that there are Mises-Wald-Church random sequences which do not satisfy the Law of Symmetric Oscillations: For certain Mises-Wald-Church random sequences x, the relative frequency satisfies n1 Sn [x] > 1/2 for all n. In terms of random walk, this means the position of the walking person stays always to the right of the origin, a violation of the Law of Symmetric Oscillations. The definition of Mises-Wald-Church randomness can be viewed as the impossibility of any successful algorithmic betting strategy of place selections. Unfortunately, Ville’s result shows that this condition is not sufficient to guarantee the randomness of a sequence (recall from subsection 3.1 that the condition must be necessary for randomness). The method outlined in subsection 3.8 of capturing the random sequences using “effective stochastic laws converging effectively” has turned out to be more
Mathematical Foundations of Randomness
673
successful, and we discuss it in Section 6. If, instead of considering betting strategies of place-selection and the resulting limiting frequency, we consider capital betting strategies (martingales), then the corresponding analog of the Mises-Wald-Church definition — namely the impossibility of any successful suitably algorithmic capital betting strategy — turns out to be more well behaved (see Section 11).
6
6.1
¨ AND SOLOVAY RANDOMNESS MARTIN-LOF
Martin-L¨of randomness
Subsection 3.8 outlined the program of defining a sequence x to be random if it satisfies all “effective probabilistic laws of randomness”, where an “effective probabilistic randomness law” is simply an “effective full-measure” set (or, what we called a “typical property” in subsection 1.6). Going to complements, this means that x should not belong to any “effective measure-zero” set (or, in the language of subsection 1.6, that x should not have any “special property”.) The question now, therefore, is how to precisely define “effective measure-zero”. Twenty-five years after the Mises-Wald-Church definition, in 1965, a satisfactory solution to this crucial problem was found by the Swedish mathematician MartinL¨of [Martin-L¨ of, 1966] , which we now describe.16 Recall that (Section 2) a set E has measure-zero if there is a sequence of open sets G1 , G2 , G3 , . . . with each Gn covering E and µ(Gn ) < 1/n. Recall also that such a sequence of sets G1 , G2 , G3 , . . . is uniformly effective open if there is a single program listing the basic intervals whose unions form these sets (subsection 4.4). Martin-L¨ of’s fundamental idea was that by simply taking the sequence of covering sets Gn to be uniformly effective open, we get the correct notion of “effective measure-zero”. A constructive proof of a probabilistic law of randomness (such as the Borel strong law) would usually proceed this way: Given n, one uniformly builds an effective open set Gn of measure less than 1/n such that every sequence in the complement of Gn satisfies the law in question, which immediately establishes that the set of sequences satisfying the law has full measure. The strongest probabilistic law of randomness that we have mentioned, the Law of Iterated Logarithms, is known to have such a constructive proof [van Lambalgen, 1987b, p. 733]. We thus have the following definitions of effective measure-zero and effective full-measure sets: 16 Martin-L¨ of
was visiting Kolmogorov in Moscow during 1964–65 and they were working on randomness and complexity of finite objects (Kolmogorov Complexity). Note also that during the twenty-five year period 1940–65, computability theory (recursion theory) was progressing vigorously and expanding its domain into classical analysis, leading to highly refined development of the notion of effectiveness, including effective open sets and effective Borel sets by Kleene, Addison, Mostowski, and others. See [Moschovakis, 1980] for more details.
674
Abhijit Dasgupta
DEFINITION 11 Martin-L¨ of. A set E ⊆ {0, 1}N is effective measure-zero iff there is a uniformly effective sequence open sets, say G1 , G2 , . . . , such that for all n: (a) E ⊆ Gn , and (b) µ(Gn ) < 1/n.17 A set has effective full-measure if its complement is effective measure-zero. For example, the set of all sequences with bit value 0 at every third position is effective measure-zero. Another example of an effective measure-zero set is the set of all sequences in which the bit pattern 0110110 does not occur. One can think of the sets G1 , G2 , G3 , . . . as providing a uniformly effective sequence of statistical tests for randomness with stronger and stronger significance. Finally, we define Martin-L¨ of Randomness. DEFINITION 12 Martin-L¨ of. A sequence x ∈ {0, 1}N is Martin-L¨ of Random iff x does not belong to any effective measure-zero set, i.e., iff x belongs to every effective full-measure set. We can think of an effective probabilistic randomness law L to be simply an effective full-measure set L , and think of a sequence x to be satisfying the law L iff x ∈ L. We can then restate the definition of Martin-L¨of randomness as: x is Martin-L¨ of Random iff it satisfies all effective probabilistic randomness laws. Martin-L¨ of also established the remarkable fact that the the set of all MartinL¨of random sequences itself has effective full-measure, that is, the set U of nonMartin-L¨ of-random sequences is effective measure-zero. This means that there is a sequenceTof uniformly effective open T sets U1 , U2 , U3 , . . . such that µ(U T n ) < 1/n and U ⊆ n Un . But also U ⊇ n Un by definition. Hence U = n Un , and thus the sequence hUn i acts as a universal test for Martin-L¨of randomness: x is Martin-L¨ of random iff x 6∈ Un for some n. This universal test condition gives an especially simple characterization of Martin-L¨of randomness.
6.2
Solovay’s characterization of randomness
The Borel-Cantelli Lemma of probability theory implies that if G1 , G2 , . . . , Gn , . . . is an infinite sequence of events and the sum of their probabilities converges (as an infinite series), then with probability one, only finitely many of these events can occur. 17 Of course, instead of the measure bounds 1/n for G , one can use any sequence of positive n rational numbers ǫn so long as ǫn can be effectively computed from n and the sequence hǫn i converges to zero. It is not hard to see that that if a computable sequence of positive rationals hrn i converges to zero, then given any other computable sequence of positive rationals hsn i → 0, there is a computable subsequence hrnk i of the original sequence with rnk < sk for all k. Therefore, the choice of the bounding sequence is completely arbitrary, so long as it forms a computable sequence of positive rationals converging to zero.
Mathematical Foundations of Randomness
675
The following remarkable result of Solovay shows that this “Borel-Cantelli condition” characterizes Martin-L¨ of randomness, provided that we restrict the sequence of open sets to be uniformly effective. THEOREM 13 Solovay. An infinite binary sequence is Martin-L¨ of random iff for every uniformly effective sequence G1 , G2 , . . . , Gn , . . . of open sets, ∞ X
n=1
µ(Gn ) < ∞ =⇒ x belongs to only finitely many Gn s.
See Chaitin [1992] for a proof. The above condition (in the theorem) for characterizing Martin-L¨of randomness is known as Solovay Randomness. P Note that in Solovay’s characterization, the infinite series n µ(Gn ) is simply assumed to converge, there is no need to assume that it converges in any effective way.
6.3
More general probability spaces
Martin-L¨ of’s and Solovay’s definitions for randomness are so general and flexible that they can be applied to any effective separable complete metric space with an effective probability measure. This includes a very large class of probability spaces, including many (perhaps most) probability spaces arising in practice. Thus the Martin-L¨ of definition provides a way of assigning a precise meaning of the word “random” in quite general settings. See [Gonz´alez, 2008; Hoyrup, 2008] for more on this. 7
RANDOMNESS OF FINITE STRINGS: KOLMOGOROV COMPLEXITY
We now turn to the “Laplacian Problem” mentioned in the introduction, namely, that of defining randomness for finite sequences. Laplace’s observation was that among all the sequences of a fixed large size, only a few “regular” ones have a “rule that is easy to grasp” and Laplace attributes this to those sequences having “a regular cause”. The other sequences, “incomparably more numerous”, are irregular and we therefore take them to be the random ones. If we follow this “Laplacian Program” then our problem reduces to precisely isolating the notion of a sequence having a “regular cause” behind it (or being generated by a “rule that is easy to grasp”). This is precisely the philosophical point missing from classical probability theory, which fails to distinguish the strings which we think of being “regular”. This problem was resolved quite satisfactorily in the mid 1960s by Solomonoff, Kolmogorov, and Chaitin. Their theory provided a measure for the information content or the complexity of a binary string (or more generally of a finite object which can be represented by a binary string) by taking it to be the length of the
676
Abhijit Dasgupta
“shortest possible complete description” of the string, or its description-complexity. The idea is based on our intuition that a relatively simple object will have a short complete description, while a highly complex one will lack a short description which can completely specify it. Moreover, the related notion of algorithmic probability invented by Solomonoff assigns a form of a priori universal probability to binary strings. But unlike classical probability, it takes into account the information-complexity of the string when assigning the probability. As a result, its probability assignments sharply discriminates between the regular strings and the random ones, and provides an explanation of Laplace’s intuition into why the regular strings are more likely to have a “cause”. We have been using the term “description” freely, without much qualification. It is important to precisely specify what is meant by a “description”. Unrestricted use of the term, as done in natural languages, causes problems, as shown by the following.
7.1
The Berry paradox
Consider the definition: The Berry number is the smallest positive integer that cannot be described in less than eighteen words. Since only a finite number of positive integers can be described using less than eighteen words, the Berry number is well-defined, and by definition it cannot be described in less than eighteen words. Yet the above definition describes it using only seventeen words. This is the Berry paradox. The problem here is with the use of the property of “a number being described in certain words”, which is not precisely defined, and cannot be defined, as used in the definition of the Berry number, without being circular. Whatever be its resolution, the Berry paradox reminds us that we have to be careful when talking about the “description” of a number or a string. Once again the concept of algorithm or effective computation allows us to make this precise: We restrict only to “algorithmic descriptions” or “effective descriptions” as defined below. Formally, by an algorithm or program we mean a Post machine program. DEFINITION 14 Algorithmic Description. Let P be a program and σ be a binary string. We say that a string δ is a P -description of σ, or that δ P -describes σ, if the program P on the input δ halts with output σ. The idea here is that whenever the program P halts on the input δ with output σ, we think of the string δ as being an algorithmic description of the string σ, according to algorithm P . The computation of σ from input δ by P is thought of as P reconstructing σ from its description δ. This is equivalently called the decompression of δ by P (into σ). The description δ is intended to be shorter
Mathematical Foundations of Randomness
677
than the object being described (the string σ), and therefore it can be viewed as a “compressed” version of the string σ. Given a program P and a string σ, we now look for the shortest string(s) P describing σ, and take the length of such string(s) as a complexity-measure of the string σ with respect to P . Of course for certain programs P , a string σ may not have any P -description, in which case the complexity of σ (with respect to P ) is considered to be infinite. DEFINITION 15 Algorithmic P -Complexity. The plain algorithmic complexity of a string σ with respect to the program P , or simply the P -complexity of σ, denoted by CP (σ), is the length of the shortest string(s) P -describing σ, provided that there are such strings. If there is no string which P -describes σ, we let CP (σ) = ∞.
There are programs P such that the P -complexity CP (σ) is a finite number for all strings σ (i.e., P has the property that ∀σ∃δ(δ P -describes σ)). Let us call such programs P to be complexity-finite. An example of a complexity-finite program is the empty program E (the program with no instructions), which computes the identity function: Every string E-describes itself. If CP (σ) is much smaller than the length of σ we think of σ being “wellcompressed” by P , since there are P -descriptions of σ much shorter than σ. The ratio |σ|/CP (σ) is the “compression factor” for the string σ, with respect to P . For the empty program E, CE (σ) = |σ| for all σ, and so the compression factor is 1 for all strings, meaning no string is really “compressed” by E. On the other hand, there are complexity-finite programs P which compress infinitely many strings by arbitrarily large factors. Example. Let P be the program informally described as follows. If the first symbol of the input string σ is 0, then P erases this leading 0, shortening its length by 1, outputs the resulting string, and halts; else P outputs the string consisting of 2|σ| 1s and halts. Then CP (σ) ≤ |σ| + 1 for all σ, since 0σ is a P -description of σ, so P is complexity-finite. But for any n, if we put δn = 1n and σn = 1m where m = 2n , then δn is a P -description of σn , so CP (σn ) ≤ |δn | = n, so the compression factor for σn is |σn |/CP (σn ) ≥ 2n /n. Given any number a, we can find n such that 2n /n > a, and so the strings σn , σn+1 , . . . are all compressed by a factor more than a. In particular, the string σ10 (string of 1024 1s) is compressed by P to the string 1111111111, thus by a factor more than 100. More drastically, the string σ64 is compressed by a factor of 288230376151711744. However, a simple but important counting argument (an example of the socalled pigeon-hole principle) shows that no method can compress strings too uniformly. Given an arbitrary partial function f : {0, 1}∗ → {0, 1}∗ , we think of f being a general method of string-description, and think of δ being an f -description of σ if f (δ) = σ. (The P -descriptions given by programs P are special “algorithmic” case of this.) We say that a string σ is compressed by f if there is an f -description of σ which is shorter than σ. More generally, given a positive integer b, we say that
678
Abhijit Dasgupta
σ is b-compressed by f if there is an f -description of σ which is shorter than σ by at least b bits (i.e., if ∃δ(f (δ) = σ ∧ |δ| ≤ |σ| − b)). Thus σ is compressed by f iff σ is 1-compressed, that is by at least 1 bit. THEOREM 16 “Only a small minority of strings compress”. For any stringdescription method f , less than half of all the strings of length ≤ k can be compressed (i.e. 1-compressed) by f . More generally, less than a fraction of 1/2b of all the strings of length ≤ k can be b-compressed by f . Proof. For any k and b, let A be the set of strings of length ≤ k, B be the set of strings of length ≤ k − b, and C be the subset of A consisting of those members which can be b-compressed by f . Note that there is a one-to-one correspondence between C and a subset of B (for each σ ∈ C, fix δσ ∈ B with f (δσ ) = σ; then the correspondence σ ↔ δσ is one-to-one). Also note that |A| (number of members of A) equals 2k+1 − 1 and |B| = 2k−b+1 − 1, hence the fraction of those strings in A which are b-compressed equals |C| |B| 2k−b+1 − 1 2k−b+1 1 ≤ = k+1 < k+1 = b . |A| |A| 2 −1 2 2 For example, among the strings of length not exceeding a thousand (or any other number) bits, more than 99.9% will either not compress at all or compress by at most 9 bits, whatever be the method of string description. We are of course interested in shortest possible descriptions or best possible compression factor. Therefore, among the complexity-finite programs we prefer the ones which tend to give overall shorter descriptions (better compression) for strings. In other words, given complexity-finite programs P and Q, we regard P to be “better” than Q if the P -complexity of σ, CP (σ) is lower than the Q-complexity CQ (σ) for all strings σ. We can then try to choose a “best” complexity-measure and use it as a standard. Unfortunately, it is impossible to get such a “best” complexity-finite program in the uniformly strict sense above, because for every complexity-finite program one can find another one which lowers complexity (compresses better) for an arbitrarily large number of strings by an arbitrarily large amount.18 Therefore, we will compare programs using a “general overall sense” rather than the uniformly strict sense above, relaxing the relation of one program being better 18 More precisely, for every complexity-finite P and every m and n, there is another complexityfinite program Q such that CQ (σ) < CP (σ) − n for at least m strings σ. Proof: Without loss of generality assume that m = 2k − 1 for some k, and fix distinct σ1 , . . . , σm such that CP (σj ) > n + k for j = 1, . . . , m. This can be done since there are only finitely many strings σ with CP (σ) ≤ n + k. Also let α1 , . . . , αm be a non-repeating listing of all strings of length less than k. Now a program P ′ can be so designed that it contains coded copies of σ1 , . . . , σm inside it, and behaves in the following way: If the input is αj (1 ≤ j ≤ m), then output the string σj and halt; otherwise, truncate the input string by removing its first k symbols and emulate the program P with this truncated input.
Mathematical Foundations of Randomness
679
than another as follows. For programs P and Q, let us define: P matches Q (in terms of complexity) ⇐⇒ ∃k∀σ(CP (σ) ≤ CQ (σ) + k), i.e., from a complexity viewpoint, the program P is regarded to be as good as the program Q or better (“P matches Q”) if the P -complexity of every string is less than its Q-complexity modulo some constant independent of the string. Let us also call programs P and Q to be complexity-equivalent if each one matches the other, i.e., if the difference between P -complexity and Q-complexity is bounded uniformly by some constant. It now turns out that in this sense, there is indeed a “best compressing” or optimal program, which is also unique in the sense that any other optimal program is complexity-equivalent to it. DEFINITION 17 Optimal or Universal Programs. A program is called universal, or complexity-optimal, or simply optimal if it matches every program. THEOREM 18 Solomonoff-Kolmogorov-Chaitin Invariance Theorem. There is an optimal program U . Moreover, every optimal program is equivalent to U . Proof. A program U is defined as follows. Given an input string σ, the program U finds the length m of the longest prefix of σ consisting of 0s (m = 0 if σ begins with 1 or is empty), and erases this initial run of 0s in σ. If the remaining string begins with a 1, that symbol is also erased, with the final result being the string ρ. U then runs the program with G¨odel number m on the input ρ. To show that U matches P for any P , let e = e(P ) be the G¨odel number of P and let kP = e + 1. For any string σ, let δ be a shortest P -description of σ, so that CP (σ) = |δ|. Put τ = 0e 1δ, then τ is a U -description of σ, so CU (σ) ≤ |τ | ≤ e + 1 + |δ| = CP (σ) + kP , where kP = e + 1 is independent of σ.
We now fix a particular optimal U and define the plain algorithmic complexity of a string σ, C(σ), to be CU (σ). DEFINITION 19 Universal Plain Algorithmic Complexity C. Define the plain algorithmic complexity function C by C(σ) = CU (σ), where U is an optimal program fixed permanently. Being optimal, U matches E, where E is the empty program with E-complexity CE (σ) = |σ|. So:
COROLLARY 20. There is k such that for all σ, C(σ) ≤ |σ| + k.
The plain algorithmic complexity C(σ) of a string σ is also known as Kolmogorov complexity. The term “Kolmogorov complexity” is used in a wide and general sense as synonym for algorithmic complexity, and so prefix-free complexity (described next) is also known as Kolmogorov complexity.
680
Abhijit Dasgupta
The invariance theorem is quite remarkable as it shows that the concept of plain algorithmic complexity is essentially unique and independent of the particular model of computation being used. The plain algorithmic complexity C(σ) of a string σ, can usefully be viewed as a measure of the (algorithmic) information content of the string σ. We therefore have a formal definition for the somewhat vague notion of information contained in a finite object.19 DEFINITION 21 Randomness and Compressibility for Finite Strings. Let σ be a string and b be a positive integer. We then define: (a) σ is b-compressible if C(σ) ≤ |σ| − b; (b) σ is compressible if it is 1-compressible, i.e., if C(σ) < |σ|; and (c) σ is random if σ is not compressible, i.e., if C(σ) ≥ |σ|. THEOREM 22 Existence of Random Strings. For every n there are random strings of length n. More generally, for any n and b > 0 at least 2n − 2n−b+1 + 1 strings of length n are b-incompressible. Proof. Fix n and for each σ with |σ| = n, pick a string δσ of smallest length describing σ. The strings δσ are distinct for distinct σ, and there are less than 2n strings of length less than n, hence by the pigeon-hole principle |δσ | ≥ n for some σ of length n, so that C(σ) ≥ n, and so σ is a random string of length n. The second statement is proved similarly. For finite strings, note that we really have relative degrees of randomness. If we have two thousand-bit strings one of which is not compressible and the other, say, 2-compressible but not 3-compressible, then the first one is more random than the second one, but only slightly more so. The complexity measure C(σ) therefore provides a measure for the degree of randomness in σ: The smaller the value of C(σ) compared to |σ|, the less random it is. Among all binary strings of a fixed length, the most random are the ones on which the function C achieves its maximum value, and the most non-random ones are the ones on which C is minimized. Also, short finite strings (such as 111) can be random yet quite simple. This is not surprising when we regard the strings as being produced by a fixed number of flips of a fair coin: If a series of three flips produces 111, there is no reason to the suspect the randomness of the process and so it is easy to also accept the outcome 111 as random. However, a long string of all 1s (say a million bits, all 19 Information content as defined by algorithmic complexity should be contrasted with the one known as Shannon entropy in “classical” information theory, where it is defined in probabilistic terms for random variables. While Shannon’s theory of information focuses on an entire set of strings associated with varying probabilities, the Kolmogorov theory focuses on an individual string. However, the two notions are closely related, see [Li and Vitanyi, 2008, p.603–608].
Mathematical Foundations of Randomness
681
1s) is dramatically non-random and will cause us to question the randomness of the process generating it. THEOREM 23. The complexity function C is not computable. Proof. (The proof resembles the argument of the Berry paradox.) If C were computable then one can define a program P which given any string σ as input computes a string σ ∗ such that C(σ ∗ ) > 2|σ|. By the invariance theorem, there is a constant k such that C(σ) < CP (σ) + k for all σ. Let σ = 1k+1 . Then C(σ ∗ ) > 2|σ|, but since σ is a P -description of σ ∗ so C(σ ∗ ) ≤ CP (σ ∗ ) + k ≤ |σ| + k < 2|σ|, a contradiction. The plain complexity function C also satisfies much of the intuitive concepts relating to “information content of a finite object.” For example, if σ is a long string with significant information content, the information content will not double in the string σσ, because of redundancy of information: If σ can be described, so can σσ, with little additional verbiage. To put it differently, every program can be modified by adding only a few lines where the final output is duplicated by “post-processing”. We therefore have: THEOREM 24. For some k, C(σσ) ≤ C(σ) + k for all σ.
Random finite strings also satisfy a very general “stochastic” property. Let R be a property of binary strings, that is R ⊆ {0, 1}∗ . We say that a binary string σ satisfies the property R if σ ∈ R, and we say that almost all strings satisfy the property R if the fraction of the strings of length n which satisfies R, |{σ ∈ R : |σ| = n}|/2n , approaches 1 as n approaches ∞. For a proof of the following result, see [Sipser, 1997, p.219]. THEOREM 25 General Stochasticity for Finite Random Strings. If almost all strings satisfy a computable property R, then all except a finite number of random strings satisfy R. The result also holds for b-incompressible strings (for any b > 0). More properties of general algorithmic complexity will be stated in the next section using a variant of the plain complexity function C that we just described. The new complexity function K will be obtained essentially by restricting the class of strings that are allowed to be algorithmic descriptions: Descriptions must now have a particular form of “unique readability” under some rule. In either case, we see that by using the concept of algorithm to formalize the fundamental idea of incompressibility or lack of shorter descriptions, we arrive at a precise and invariant definition of randomness for finite strings that remarkably captures our intuition as described by Laplace. See subsection 8.2 for more discussion on this topic. We also note that while we used Post-Turing computability (Post Machine programs) as the model of computation, any other model of computation could be used satisfactorily. A particularly recent new approach is to use Binary Lambda Calculus, due to John Tromp [2009], to study Kolmogorov Complexity.
682
Abhijit Dasgupta
The literature of modern research in the subject of Kolmogorov Complexity is vast, see [Li and Vitanyi, 2008]. We end this section with a slight digression by giving a specific example of an application of algorithmic complexity.
7.2
An application to G¨odel incompleteness
While most strings are random, only a finite number of them can be proved to be so. For each string σ, let nσ denote its complexity, i.e., nσ = C(σ). In any formalization of mathematics (say ZFC), the relation “the complexity of σ is n” can be formally expressed. Using a variant of the argument of the Berry paradox and the fact that the set of theorems is computably enumerable, we now show that while the sentences “the complexity of σ is nσ ”,
σ = Λ, 0, 1, 00, 01, 10, 11, 000, 001, . . .
are all true, only finitely many of them can be proved in the theory. We therefore have an “information theoretic version” of G¨odel’s Incompleteness Theorem. It also follows that there is a “maximum provable complexity”: For some m, no string can be proved to have complexity more than m. [Boolos and Jeffrey, 1989; Davis, 1980] THEOREM 26 G¨ odel’s Incompleteness Theorem, Information-Complexity Version. Only finitely many of the sentences of the form “C(σ) = n”, where σ ranges over binary strings and n over natural numbers, are theorems of mathematics. Proof. Let C(x, y) be the formula expressing “the complexity of x is y” in the formal theory,20 and also for each string σ and natural number n, let pσq and pnq be their formal names in the theory. Since the set of theorems can be computably enumerated, there is a program (Post machine) P which, on input string δ, searches through all the theorems to check if any of them is of the form C(pσq, pnq) with n > 2|δ|, and if one such theorem is found, then outputs the string σ and halts. Using the invariance theorem fix k such that C(σ) ≤ CP (σ) + k for all σ. Now run the program P on the input string δk = 1k . Since there are infinitely many n for which C(pσq, pnq) is a theorem, so P eventually finds such a one with n = n0 > 2k and halts with an output string σ = σ0 . Since C(pσ0 q, pn0 q) is true, so C(σ0 ) = n0 . But C(σ0 ) ≤ CP (σ0 ) + k ≤ |δk | + k = 2k < n0 , a contradiction. Interestingly, while only finitely many of the true statements “the complexity of σ is nσ ” can be proved (where, again, nσ denotes the unique natural number 20 To do this, recall our fixed optimal program U , and let a relation H U be defined as HU (x, n, y) ⇐⇒ U halts with output x in less than n program execution steps on some input string of length not exceeding y. Since HU is computable, there is a formula ψ(x, n, y) such that ψ(x, n, y) is provable if HU (x, n, y) is true, else its negation ¬ψ(x, n, y) is provable. Now let C(x, y) be ∃n(ψ(x, n, y)) ∧ ¬∃z < y(∃n(ψ(x, n, z))).
Mathematical Foundations of Randomness
683
equal to the complexity of σ, i.e. nσ = C(σ)), it is also easy to see that the true statement “the complexity of σ is ≤ nσ ” can be proved for every string σ. In other words, while we cannot prove that the complexity of σ equals nσ (except for finitely many strings σ), we can prove, for every string σ, that the complexity of σ does not exceed nσ (its true value), without being able to recognize that nσ is indeed the true value for the complexity of σ. For a critical discussion of the information-complexity version of G¨odel incompleteness, see [van Lambalgen, 1989].
8
THE PREFIX-FREE COMPLEXITY K
While the plain complexity measure C(σ) yields a quite satisfactory theory of randomness for finite strings, the function C has some defects. One such defect is about how it relates to the randomness of infinite sequences. If we look at the initial segments σn = hx1 , x2 , . . . , xn i of an infinite sequence x, the complexity of the n-th initial segment C(σn ) drops by an undesirable amount for infinitely many n, a phenomenon known as complexity oscillation [Li and Vitanyi, 2008, p.143]. Also, for the needs of Solomonoff’s theory of Algorithmic Probability, using the literal value of plain complexity measure was not the correct formulation. Several ideas were developed for dealing variously with such problems, such as monotone complexity, process complexity, decision complexity, and uniform complexity, but it is prefix-free complexity, also called prefix complexity in short, due to Levin [Kolmogorov, 1974], G´acs [1974] and Chaitin (see [Chaitin, 1992] for references), that has now the become the standard for algorithmic (Kolmogorov) complexity, and is denoted by K. It is very much like C, but with a more restricted definition of “description”, where only certain classes of strings having a form of unique readability are allowed to be descriptions. Suppose that we want to pass a string σ directly to a Post machine program as input by placing the string on the tape (all other bits being zero) and starting the program with its head at the beginning bit of the string. Unfortunately, since the tape consists only of 0s and 1s and no special termination markers, there is no general way for any program to determine the end of the string. For example if the head is started on a single 1 with all other tape bits being 0, how can the machine know be sure that there is not another 1 after a trillion bits? Even simpler, how can the machine know if this is supposed to represent the string 1, or 10, or 10000? We express this problem by saying that the plain representation of a binary string is not properly delimited. We circumvented this problem by converting a string σ first to its number code num(σ) = n (say), and then passing the unary coded version of n, namely 1n+1 0, to the machine. This coding is an example of a prefixfree or self-delimiting code, where no code string is a proper initial segment of another, and the input to the machine is uniquely determined. More generally, a set S of strings is said to be prefix-free if no string in S is a prefix of another member of S. The set of unary codes, 1n+1 0, n = 0, 1, 2 . . . , do form a prefix-free
684
Abhijit Dasgupta
set, but is exponentially more inefficient than passing the plain binary string. The following example describes a more efficient prefix-free coding. Example 1 (The 1-code). Consider the scheme where each binary string σ is coded by 1|σ| 0σ. The string 1|σ| 0σ will be called the 1-code of σ. For example, the 14-bit string τ = 00011101011100 is the following 1-code: 1|τ | 0
τ
z }| { z }| { 111111111111110 00011101011100 Prefix-free 1-coding of τ
The set of 1-codes form a prefix-free set. In fact, by placing the head at the beginning of the 1-code of σ, it can be uniquely decoded back to σ (by first decoding from the initial unary part 1|σ| 0). It will take 2|σ| + 1 bits to encode the plain binary string σ into its 1-code (e.g., in the above displayed example |τ | = 14, so the 1-coded string has length 29). Example 2 (The 2-code). An even more efficient prefix-free coding is be obtained by the following scheme. Given a binary string σ, first express the length |σ| of the string in plain binary notation bin(|σ|) and then prefix σ with the 1code of bin(|σ|) to get what we will call the 2-code of σ. For example, the the string τ = 00011101011100 of the previous example has length 14, which in binary notation is 1110. Since 1110 has length 4, its 1-code is 111101110, and we prefix this to τ to get the 2-code of τ : 1| bin(|τ |)| 0
bin(|τ |)
τ
z }| { z }| { z }| { 1 1 1 1 0 1 1 1 0 0 0 0 1 1 1 0 1 0 1 1 1 0 0 Prefix-free 2-coding of τ
The 2-code gives a prefix-free encoding which will encode the string σ using |σ| + 2 log2 |σ| + 1 bits. For example, the 2-code of τ shown above consists of 23 bits, a saving of 6 bits over its 1-code. Further improvements can be made by iterating this method. Note that all these encodings are effective, i.e., there are simple algorithms for decoding and encoding strings according to any of these schemes. Moreover, the above examples are length-monotonic, meaning that longer strings have longer codes. A real world example of a prefix-free set (over the alphabet of decimal digits) is the set of country dialing codes in the international telephone system. We will now define prefix-free complexity, which is quite similar to plain complexity and is defined as the length of the “shortest description”. The main difference is that while for plain complexity any binary string could possibly count as a “description”, for prefix-free complexity only “prefix-free strings” (under an effective prefix-free encoding) are allowed to be descriptions. DEFINITION 27 Prefix-free Complexity Functions. A partial function ψ : {0, 1}∗ → {0, 1}∗ mapping strings to strings is prefix-free if its domain is prefix-free. Given
Mathematical Foundations of Randomness
685
a partial computable prefix-free function ψ, the prefix-free ψ-complexity function Kψ is defined by letting Kψ (σ) to be the length of the shortest string(s) δ for which ψ(δ) = σ, and putting Kψ (σ) = ∞ no such strings exist.
An example of a prefix-free complexity function ψ2 is obtained by coupling the decoding function for the 2-codes with the optimal program U for plain complexity C as follows. Let D2 denote the set of all 2-coded strings, so that D2 is a prefix-free set. The decoding function d2 : D2 → {0, 1}∗ for decoding 2-codes establishes a one-to-one correspondence between D2 and {0, 1}∗ , but we regard it as a partial function from {0, 1}∗ to {0, 1}∗ . If a string δ is not in D2 , then d2 (δ) is not defined, so that the domain of d2 is D2 . Now define ψ2 to be the function which, given an input string δ, regards δ as 2-code for some string, decodes it into d2 (δ) and sends this decoded string to the program U as input. U , in turn, runs with d2 (δ) as input, and may halt with an output string σ, in which case we put ψ2 (δ) = σ. If δ is not in D2 so that d2 (δ) is not defined, or if U does not halt on input d2 (δ), then we leave ψ2 (δ) as undefined. Clearly ψ2 is partial computable and its domain is a subset of D2 , hence it is a partial computable prefix-free function. Let the corresponding prefix-free complexity function Kψ2 be denoted simply by K2 . We thus have an example of a prefix-free complexity function K2 . How does K2 compare with the plain complexity function C? Given a string σ with plain complexity n = C(σ), let τ be a string of length n for which U outputs σ on input τ . Let δ be the 2-code for τ , so that d2 (δ) = τ , and thus |δ| = |τ | + 2 log2 |τ | + 1 = n + 2 log2 n + 1. Now by definition, ψ2 (δ) = σ. Moreover, since encoding-decoding using the 2-code is length monotonic, so there is no string δ ′ shorted than δ with ψ2 (δ) = σ, hence: K2 (σ) = |δ| = n + 2 log2 n + 1 = C(σ) + 2 log2 C(σ) + 1, which shows that K2 (σ) exceeds C(σ) by 2 log2 C(σ) + 1. Of course K2 is not “optimal” and there are “better” prefix-free partial computable functions ψ giving lower lower complexity values. In order to get a “universal optimal” function, we need the following fundamental result. THEOREM 28 Invariance Theorem for Prefix-free Complexity. There is an partial computable optimal prefix-free function ξ. That is, for each partial computable prefix-free function ψ there is a constant k satisfying: Kξ (σ) ≤ Kψ (σ) + k,
for all σ.
DEFINITION 29 Universal Prefix-free Complexity. The Prefix-free Algorithmic Complexity K(σ) is defined by taking K(σ) = Kξ (σ), where ξ is a permanently fixed optimal partial computable prefix-free function. How does the prefix-free complexity value K(σ) compare with the plain complexity value C(σ)?
686
Abhijit Dasgupta
Roughly speaking, since fewer strings are allowed to be descriptions of σ under prefix-free complexity K, we may expect a somewhat higher value for the prefixfree complexity K(σ) than C(σ) (modulo a constant). This indeed turns out to be true. On the other hand, we can use 2-codes to convert any plain description δ into a prefix-free description of length |δ| + 2 log2 |δ| + 1. More precisely, for the specific prefix-free complexity function K2 in the example above, we saw that K2 (σ) = C(σ) + 2 log2 C(σ) + 1. For the optimal prefix-free complexity K, we would get an overall lower value for K(σ) than K2 (σ) (modulo a constant), and so C(σ) + log2 C(σ) + 1 is only an upper bound for K(σ). The following result gives a standard upper bound for K(σ). THEOREM 30. Modulo additive constants, C(σ) ≤ K(σ) ≤ C(σ) + 2 log2 |σ|. Proof. The first inequality follows from the invariance theorem for C (optimality of C) since the partial computable function ξ used to define K is just one specific “program”. To prove the second inequality, note that we had earlier already established that K2 (σ) ≤ C(σ) + 2 log2 C(σ) and C(σ) ≤ |σ|, modulo constants. Combining the two, the result follows. Of course, using more efficient prefix-free encodings, this result can be further sharpened. K has many nice properties which are lacking in C. From now on, we will be using the prefix-free complexity function K in place of C. In particular, the definitions for compressibility and randomness for finite strings are redefined as follows. DEFINITION 31 Randomness and Compressibility for Finite Strings. Let σ be a string and b be a positive integer. We then define: (a) σ is b-compressible if K(σ) < |σ| − b; otherwise it is b-incompressible; (b) σ is compressible if it is 1-compressible, i.e., if K(σ) < |σ|; otherwise it is called incompressible; and (c) σ is random if σ is incompressible, i.e., if K(σ) ≥ |σ|. Since K assigns a higher complexity value to strings than C (uniformly modulo a constant), strings compress a “little less” under K than under C, and so random strings now become “more numerous”. In particular, the existence theorem for random strings (which holds for any complexity measure) remains valid.
Mathematical Foundations of Randomness
687
THEOREM 32 Existence of Random Strings. For every n there are random strings of length n. More generally, for any n and b > 0 at least 2n − 2n−b+1 + 1 strings of length n are b-incompressible.
8.1
Properties of finite random strings
We mention two more “stochastic” properties of finite random strings which conform to our intuition as evidence that incompressibility is the correct definition of randomness for finite strings. THEOREM 33. Long random strings have a “balanced” number of 0s and 1s. More precisely, for any ǫ > 0 there is k such that for all random strings σ of length n > k, we have Sn [σ] 1 < ǫ, − n 2 where Sn [σ] denotes the number of 1s in σ.
THEOREM 34. Any run of zeros or ones in a random string n is asymptotically bounded above by O(log n). That is there is k and a constant a such that for every random string σ of length n > k, the longest run of 0s in σ is less than a log n.
8.2
Kolmogorov Complexity as vindication of Laplace
We end our discussion of randomness for finite strings with the position that Kolmogorov Complexity provides a satisfactory solution to Problem 2 of the Introduction. As outlined and evidenced above, the incompressibility definition of randomness for finite strings conforms quite well to our intuition. In fact, an important test of randomness for finite strings now is to apply standard computational compression programs to the the string in question and check if it compresses or not. Kolmogorov Complexity also provides a strong vindication of all of Laplace’s intuitions, by classifying strings according to their complexity: The lower the K value, the more “regular” the strings, and the higher the K value the more “irregular” (or random) they are. Cause and Regularity Laplace mentions that we perceive a “cause” in strings which are “regular”, “those in which we observe a rule that is easy to grasp.” Kolmogorov complexity provides a precise and objective way to define this idea, using short effective descriptions. Given a string σ with small complexity value K(σ), its “cause” (or a “rule that is easy to grasp”) is any of its minimal descriptions, i.e. a minimal length strings (of length K(σ)) which describes σ via the optimal algorithm. If K(σ) is not small compared to |σ|, we regard the string σ as “irregular” or random. This interpretation of cause and regularity is obtained by classifying strings by their complexity, i.e., by measuring how short an effective description is possible. The invariance of
688
Abhijit Dasgupta
Kolmogorov complexity under all possible methods of effective description (for sufficiently large strings) shows that this is not an arbitrary measure of complexity, but is essentially an objective one. Rarity of Regularity Laplace mentions that the “irregular sequence . . . are incomparably more numerous” compared to the “regular” ones. This is again confirmed by that fact that if cause or regularity is defined as descriptions which are sufficiently short, then the irregular sequences automatically become “incomparably more numerous”. We saw this in the theorem which showed that only a small minority of strings compress well. Probability of Regular Strings Laplace explains that while the regular strings are much less numerous, if we observe a highly regular but long string, “we seek a cause whenever we perceive symmetry”, and it is “more probable” that “this event ought to be the effect of a regular cause” than “that of chance”. For example, let σ be the thousand bit long string containing the pattern 01010101 . . . throughout the entire string. If we observe σ, our intuition tells us that in a sense not explained by classical probability theory, it is remarkably different from another random string generated by a thousand coin flips. Classical probability, being information-blind, will assign the same probability 1/2|σ| = 1/21000 to σ, as well as to all other thousand bit string. But the algorithmic probability (using Kolmogorov complexity) of σ is 1/2K(σ) . If in addition to random coin flips we also consider “effective causes”, then algorithmic probability remarkably explains Laplace’s intuition. As the string σ is regular (has a short description), K(σ) will be quite smaller than |σ| = 1000, and so its algorithmic probability 1/2K(σ) will be much higher compared to another random string of the same length, since the algorithmic probability of a random string will be, by definition of randomness for finite strings, at most 1/2|σ| = 1/21000 . In other words, the probability that σ was generated by an “effective cause” will be higher than the probability that it was generated randomly, by a factor of at least 2|σ|−K(σ) = 21000−K(σ) . With the conservative estimate of K(σ) = 950, this factor is 1125899906842624. At last, the Laplace Program is realized.
9
KOLMOGOROV-CHAITIN RANDOMNESS AND SCHNORR’S THEOREM
We now return back to randomness for infinite sequences. We will say that an infinite binary sequence x is b-incompressible if every initial segment hx1 , x2 , . . . , xn i of x is b-incompressible as a finite string, i.e. if K(hx1 , x2 , . . . , xn i) > n − b
for all n.
Mathematical Foundations of Randomness
689
We say that the infinite binary sequence x is incompressible it is b-incompressible for some b, i.e., if no initial segment of x can be compressed by more than a fixed number of bits. This property of incompressibility of an infinite binary sequence x can be regarded as an information-complexity definition of randomness for x. DEFINITION 35 Kolmogorov-Chaitin Randomness. An infinite binary sequence x is Kolmogorov-Chaitin Random if x is incompressible (no initial segment of x can be compressed by more than a fixed number of bits), or in other words, if For some b:
K(hx1 , x2 , . . . , xn i) > n − b
for all n.
Remark: The term “Kolmogorov-Chaitin random” is not in standard use. In the literature it is known variously as “Chaitin random”, “Levin-Chaitin random”, “Levin-Chaitin-Schnorr random”, K-incompressible, etc. The definition of Kolmogorov-Chaitin randomness appears to be significantly different when compared to the definition of Martin-L¨of randomness (or Solovay randomness). The notion of Martin-L¨ of randomness is based on effective stochastic laws — or predicates (properties) which are satisfied almost surely, i.e. with probability one. Randomness of a sequence x in the Martin-L¨of sense is defined not by looking at the individual sequence x alone, but using an entire collection of predicates of x, and the definition appears to be in the form a “second-order” definition, involving universal quantification over predicates of sequences.21 On the other hand, the Kolmogorov-Chaitin definition does not directly refer to any external object other than the sequence x itself and the complexity measure K. Instead of taking an “external top down” approach, it looks at x “from inside” in terms of initial segments (a purely internal view), measuring their informationcomplexity using K, and declares x to be random if none of the initial segments admit any substantially shorter description. The Kolmogorov-Chaitin definition therefore reduces the definition of randomness for infinite sequences to that for finite strings, establishing a fundamental connection between the two notions. This striking dissimilarity makes the following celebrated theorem of Schnorr truly remarkable. THEOREM 36 Schnorr’s Theorem. A sequence is Martin-L¨ of random if and only if it is Kolmogorov-Chaitin random. For a proof, see any of [Nies, 2009; Downey and Hirschfeldt, 2010; Li and Vitanyi, 2008; Chaitin, 1992]. The equivalence of Martin-L¨of randomness with Kolmogorov-Chaitin randomness forms the basis of the assertion that Martin-L¨of’s definition has truly captured the notion of randomness for infinite sequences, and therefore gives a satisfactory 21 Of course, since the predicates in question are effectively enumerated, in actuality the universal quantifier is reduced to range over natural numbers, but we are referring to the form of the definition in classical terms.
690
Abhijit Dasgupta
solution to this classic problem in the philosophy of mathematics and statistics (Problem 1 of the Introduction). Moreover, as the Kolmogorov-Chaitin definition shows, the notions of randomness for finite and infinite sequences are fundamentally linked, and therefore the solutions to both Problems of the introduction can be given simultaneously in an interconnected fashion. (A characterization of Martin-L¨of randomness in terms of plain complexity C has also been obtained, but it is a much more complicated condition compared to the one for K.) From now on, by a random infinite sequence we will mean a Martin-L¨of random or equivalently Kolmogorov-Chaitin random sequence. DEFINITION 37 Randomness for Infinite Sequences, Final Version. An infinite binary sequence x will be called random if it is Martin-L¨of random or equivalently Kolmogorov-Chaitin random. The assertion that Martin-L¨ of randomness or equivalently Kolmogorov-Chaitin randomness captures the “true notion of randomness” conforming to our intuition is sometimes called the Martin-L¨ of-Chaitin thesis. The Martin-L¨of-Chaitin thesis, like the Church-Turing thesis for the definition of algorithm, is not a mathematical proposition that can be be proved or refuted. We discuss it further in Section 12.
9.1
Properties of infinite random sequences
We list here some regularity properties of infinite random sequences as evidence that we have the correct definition of randomness for infinite sequences. Recall that when we say “random” without qualification, we mean Martin-L¨of random, or equivalently Kolmogorov-Chaitin random. For proofs and further details of the following facts, see [Calude, 1994; Li and Vitanyi, 2008; Nies, 2009; Downey and Hirschfeldt, 2010]. THEOREM 38 Effective Place Selections Preserve Randomness. Let hx1 , x2 , . . . i be a random infinite binary sequence and ϕ : {0, 1}∗ → {0, 1} be a partial computable function. Suppose that ϕ(hx1 , . . . , xn−1 i) = 1 for infinitely n, and n1 = the least n such that ϕ(hx1 , . . . , xn−1 i) = 1, n2 = the next such n, etc. Then the subsequence hxn1 , xn2 , . . . , xnk , . . . i is also random. COROLLARY 39. Every random sequence is Mises-Wald-Church random.
COROLLARY 40 Computable Restrictions Preserve Randomness. If x is random, and n1 < n2 < n3 < . . . form a computable sequence of strictly increasing numbers then the sequence hxn1 , xn2 , xn3 , . . . i is also random.
The above result remains true if n1 , n2 , . . . form a computable sequence of distinct numbers (not necessarily increasing).
COROLLARY 41. If x is random, then neither the set {n : xn = 1} nor its complement {n : xn = 0} can contain any infinite computably enumerable set (they are immune). In particular, neither these sets nor the sequence x is computable.
Mathematical Foundations of Randomness
691
A real number a is called computable if the set {(m, n) : m/n < |a|} is a computable subset of N × N. A real number a ∈ [0, 1] is called random if there is a random sequence x ∈ N {0, that x is the sequence of digits in a binary expansion of a, i.e. a = P∞1} such n x /2 . n n=1
COROLLARY 42. If a ∈ [0, 1] is a random real number, then a is not computable. In particular, a is irrational, and in fact transcendental, since all algebraic real numbers are computable.
The following result shows that like being convergent, being random is an eventual property. THEOREM 43. The randomness of a sequence is a “tail” property. In particular: (a) If the sequence y is obtained from x by altering only finitely many values of x, then x is random iff y is random. (b) If x = hx1 , x2 , . . . , xn , . . . i and y = hxn+1 , xn+2 , . . . , xn+k , . . . i is obtained by removing the first n terms of x (an n-step shift), then x is random iff y is random. THEOREM 44. If x is random, then: (a) x satisfies the law of iterated logarithms. (b) x is absolutely Borel normal: If the real number x ˆ having x as its binary expansion digits is expanded in base b > 1, then the resulting expansion is Borel normal in base b.
9.2
An example of a specific random sequence: Ω
It is clear that a set S of strings is prefix-free iff the set of basic intervals N (σ) (in the Cantor space) indexed by strings from S form a disjoint family, and so the open set formed by their union has a measure P equal to the sum P of the measures of the basic intervals, i.e., µ (∪σ∈S N (σ)) = σ∈S µ(N (σ)) = σ∈S 1/2|σ| . Therefore: For any prefix-free set S of strings, we have:
X 1 ≤ 1, 2|σ| σ∈S
an important fact known as the Kraft inequality. Recall now the partial computable prefix-free function ξ that used to define the optimal prefix-free complexity K. We define a real number Ω, called the Halting Probability or Chaitin’s Omega, by: Ω=
X
σ∈dom(ξ)
1 . 2|σ|
692
Abhijit Dasgupta
Since the domain of ξ is a prefix-free set, Ω ≤ 1 by the Kraft inequality. Fix a program Q which computes the partial computable function ξ. Then Q halts on an input binary string δ iff ξ(δ) is defined. Suppose now that a fair coin is flipped until some initial segment of the sequence of flips is found to be in the domain of ξ, or equivalently until it generates a string having an initial segment on which Q halts. Of course, in many cases no such string will be generated (i.e. we may have an infinite sequence of flips for which there is no initial segment string on which Q ever halts). In this sense, Ω denotes the probability that Q halts if its input is generated by a random sequence of coin flips. This is the reason Ω is known as the Halting Probability. It is a non-computable real number, and so is transcendental. Alternatively, define an open set by: [ N (σ). GΩ = σ∈dom(ξ)
Then GΩ is effective open and Ω equals the Lebesgue measure of GΩ . The infinite sequence of bits forming the binary expansion of Ω is also denoted by Ω. Ω is then a random infinite sequence, our first example of a specific random infinite sequence. Ω has many remarkable properties, see [Bennett, 1979; Calude, 1994; Chaitin, 1992]. 10
RELATIVE AND STRONGER RANDOMNESS. HIERARCHIES
Given two sequences x = hx1 , x2 , . . . i and y = hy1 , y2 , . . . i, we merge them into a single sequence x ⊕ y by intertwining the terms as follows: x ⊕ y = hx1 , y1 , x2 , y2 , . . . , xn , yn , . . . i. We call x ⊕ y the join of x and y. Since each of x and y can be extracted from x⊕y, we can, from an information content point of view, think of x⊕y as perfectly combining the information contained in x and that in y, without any “information loss”. From the properties of randomness given earlier, it is immediate that if x ⊕ y is random then so are both x and y. However, if x and y are random, it does not follow that x ⊕ y is random. As a drastic example, if x = y, then x ⊕ y cannot be random since every alternate pair of consecutive bits of x ⊕ y would be identical, which allows one to devise a simple successful gambling system against it (it also violates Borel normality as neither the pattern 010 nor the pattern 101 occurs in it). So the question arises: Under what conditions on x and y do we have x ⊕ y random? This was answered beautifully by van Lambalgen in terms of the notion of relative randomness, which we now discuss. Roughly speaking, x is random relative to
Mathematical Foundations of Randomness
693
y (abbreviated x is random in y) if even a complete knowledge of y does not improve the predictability of the bits of x. This is exactly opposite of the situation in our drastic example of x = y, where knowledge of y allows us to perfectly predict x, or, in other words, the information about x can be obtained (in this case completely) from information of y. We may therefore expect, that the relation of one sequence being random relatively to another is some form of “information-independence”, although it is not a priori clear that this relation should be symmetric (the drastic example shows that the relation must be irreflexive). We proceed to formalize this idea. DEFINITION 45 Effective Open, Relative Version. Let z ∈ {0, 1}N . A set G is said to be effective open relative to z, or simply effective open in z, or in symbols G is Σ01 (z), if there is an effective open set H such that for all x ∈ {0, 1}N : x ∈ G ⇐⇒ x ⊕ z ∈ H. Notice how, in the formation of G, the information of z becomes available: As before, G is still the union of basic intervals which are enumerated by some computation, but now that computation is also allowed to use any additional information from z as needed. Thus, if G is effective open, then G is effective open in z, for any z (additional information from z is available, but not used, in the computation which enumerates basic intervals forming G). On the other hand if G is effective open in z and z is computable, then G is already effective open (since z can be computed by some program P , the computation which enumerates basic intervals forming G does not get any additional help by knowing z, since any information about z could also be obtained by calling P as a subprogram). Similarly, we define a sequence of sets being “uniformly effective open in z”. DEFINITION 46 Uniformly Effective Open (Relative). A sequence G1 , G2 , . . . of sets is uniformly effective open in z, or uniformly Σ01 (z), if there are sets H1 , H2 , . . . , uniformly effective open, such that for all n, x ∈ Gn ⇐⇒ x ⊕ z ∈ Hn . Again, we define a set being “effective measure-zero in z” just by changing the old definition with “effective open” replaced by “effective open in z”. (This process, known as relativization, can actually be carried out fruitfully throughout most of computability theory.) DEFINITION 47 Effective Measure-Zero, Relative Version. A set E is effective measure-zero in z if there are T sets G1 , G2 , . . . , uniformly effective open in z, such that µ(Gn ) < 1/n and E ⊆ n Gn . And, finally:
DEFINITION 48 Relative Randomness. x is random relative to y (or x is random in y) iff x does not belong to any set effective measure-zero in y.
694
Abhijit Dasgupta
Now we can state van Lambalgen’s theorem: THEOREM 49 Van Lambalgen. For x, y ∈ {0, 1}N , the following conditions are all equivalent to each other: (a) x ⊕ y is random; (b) y is random in x and x is random in y; (c) y is random and x is random in y; Van Lambalgen’s theorem is remarkable, because the apparently weaker third condition in the theorem implies the second. In particular, if x and y are random, then x is random in y =⇒ y is random in x, which is surprising, because, as we mentioned earlier, this symmetry is not at all clear a priori. The existence theorems all remain valid under relative randomness. E.g., for any y, the set of sequences random in y form a full-measure set whose complement is effective measure-zero in y. If y is computable, this does not give a new collection, as then the set of sequences random in y equals the set of random sequences. But if y is not computable, we may have stronger versions of randomness. E.g., if y is the characteristic function of the uncomputable set ∅′ , or if y is Ω (more precisely, the sequence y consists of the digits in the binary expansion of Ω), then the collection of sequences random in y cannot contain Ω anymore, and therefore form a strictly smaller subclass of the random sequences, known as the 2-random sequences. Using a prefix free function ξΩ partial computable in Ω which is optimal for all prefix free functions partial computable in Ω, one can now define the halting probability relative to Ω as: Ω2 =
X
σ∈dom(ξΩ )
1 . 2|σ|
Then Ω2 is 2-random, while Ω is random but not 2-random. This process can be iterated to stronger and stronger versions of randomness, which we now describe in greater generality.
10.1
The arithmetical hierarchy and n-randomness
We will use a notation where relations are identified with predicates: If R is a 3-place relation, then we abbreviate “(a, b, c) ∈ R” by simply writing R(a, b, c), etc. A relation A ⊆ Nk × {0, 1}N is called effective open if there is a computably enumerable E ⊆ Nk × {0, 1}∗ such that for all m1 , . . . , mk ∈ N and x ∈ {0, 1}N , A(m1 , . . . , mk , x) ⇐⇒ E(m1 , . . . , mk , σ) for some initial segment σ of x.
Mathematical Foundations of Randomness
695
A relation is called effective closed if its complement is effective open, and it is called computable (or effective clopen) if it is both effective open and effective closed. Starting with the computable relations as a basis, we can define relations of higher complexity by adding a series n “alternating quantifiers” ranging over natural numbers as follows. We define a relation A to be Σ0n (n ≥ 1) if there is a computable relation R such that for all m1 , . . . , mk ∈ N and x ∈ {0, 1}N , A(m1 , . . . , mk , x) ⇐⇒ (∃p1 )(∀p2 ) . . . (Qpn )R(p1 , . . . , pn , m1 , . . . , mk , x), where Q stands for “∃” if n is odd and for “∀” if n is even. We also define a relation to be Π0n if its complement is Σ0n , and a relation to be 0 ∆n if it is both Σ0n and Π0n . It then turns out that the class Σ01 coincides with the class of effective open relations, and ∆01 is same as the class of computable relations. Moreover, this indeed gives a strict hierarchy of classes of relations defined by their definitional complexity (the number of alternating quantifiers), with the class ∆0n strictly contained in each of Σ0n and Π0n , both of which are strictly contained in ∆0n+1 , as shown below.
...
∆0n+1
∆0n
...
(
(
(
Π02
...
(
(
(
Π01
(
(
∆03
∆02
(
∆01
(
(
(
Σ0n
Σ02
Σ01
Π0n
This hierarchy is called the Arithmetical Hierarchy. (It is a refinement of the finite levels of the classical hierarchy of Borel sets in analysis, but only effective countable unions and intersections are allowed. See [Rogers Jr, 1987; Odifreddi, 1992; Moschovakis, 1980] for more details.) Finally, we can define n-randomness. DEFINITION 50. A sequence z is called n-random iff there is no Σ0n -effective measure-zero set containing z, or more precisely if there is no sequence H1 , H2 , . . . of sets such that the relation H defined by H(n, x) ⇐⇒ x ∈ Hn is Σ0n , µ(Hn ) < 1/n for all n, and z ∈ Hn for all n. We say z is arithmetically random iff z is n-random for all n = 1, 2, . . . . One can show that it does not make any difference to the definition whether we require the sets Hn to be all open or not. Thus, 1-random is same as being random (i.e., Martin-L¨of-random), and we have a sequence of stronger and stronger notions of randomness, corresponding to the levels of the arithmetical hierarchy. This hierarchy is indeed strict, meaning that (n+1)-randomness implies n-randomness, but for each n there is an n-random sequence which is not (n + 1)-random. E.g., Ω is 1-random but not 2-random, and
696
Abhijit Dasgupta
Ω2 is 2-random but not 3-random, and so on. Moreover, x is 2-random ⇐⇒ x is random relative to Ω ⇐⇒ x is random and Ω is random relative to x. The first of these equivalences follow from standard facts about the arithmetical hierarchy, and the second from van Lambalgen’s theorem. Recall that an infinite binary sequence x = hx1 , x2 , . . . i was defined to be Kolmogorov-Chaitin random if for some b > 0, K(hx1 , x2 , . . . , xn−1 i) > n − b for all n. If prefix complexity K is replaced by plain complexity C, then one produces an empty definition, as was proved by Martin-L¨of: There is no sequence x = hx1 , x2 , . . . i such that for some b > 0, C(hx1 , x2 , . . . , xn−1 i) > n − b for all n. However, it was also shown that the sequences x which satisfy the condition: for some b > 0, C(hx1 , x2 , . . . , xn−1 i) > n − b for infinitely many n, form a full-measure set which is contained in the set of random sequences. Such sequences are sometimes called Kolmogorov Random. Remarkably, it was established recently that the Kolmogorov random sequences are precisely the 2-random ones.
10.2
Other stronger notions of randomness
Many other notions of randomness stronger than 1-randomness have been studied. E.g., by relaxing the condition that µ(Hn ) < 1/n in the definition of n-randomness to limn µ(Hn ) = 0, one obtains the notion of weak-(n + 1)-randomness, which lies strictly between n-randomness and (n + 1)-randomness. One can extend the arithmetical hierarchy into the transfinite using computable ordinals, which results in the hyperarithmetical hierarchy. Martin-L¨of first suggested the notion of hyperarithmetical randomness. Beyond the hyperarithmetical classes, there is an even more comprehensive hierarchy known as the analytical hierarchy. At the first level of this hierarchy are the Π11 relations. A relation A is called Π11 if there is an arithmetical relation (Σ02 is enough) B such that A(n1 , . . . , nk , x) ⇐⇒ ∀y B(n1 , . . . , nk , x ⊕ y). (The class of Π11 relations includes all hyperarithmetical sets and more, and hence certainly all arithmetical sets as well.) The notion of randomness in the sense of Martin-L¨of has recently been extended to the class of Π11 sets (and has been named Π11 -ML-randomness) fruitfully by Hjorth and Nies [Hjorth and Nies, 2007], and they have established the analog of Schnorr’s theorem and other results. The strongest notion of randomness that appears to have been studied so far is called Π11 -randomness by Hjorth and Nies [Hjorth and Nies, 2007]. The union of all Π11 measure-zero sets is itself a Π11 measure-zero set (the largest Π11 measurezero set),, so Π11 -randomness is defined as membership in the complement of the largest Π11 measure-zero set.
Mathematical Foundations of Randomness
10.3
697
Reducibility and degrees of computability
Given A ⊆ N and z ∈ {0, 1}N , we say that A is computably enumerable in z, or in symbols A ∈ Σ01 (z) if there is a c.e. set B ⊆ N × {0, 1}∗ such that ∀n(A(n) ⇐⇒ ∃kB(n, hz1 , z2 , . . . , zk i)). We say that A is computable in z, or A is Turing-reducible to z, in symbols A ≤T z, if both A and its complement are computably enumerable in z. Finally a sequence x ∈ {0, 1}N is computable in z or Turing-reducible to z if the set {n : xn = 1} is computable in z. Roughly speaking, x ≤T y means that x can be computed by a program which has access to the bits of y in order, or even more vaguely y is computationally at least as complex as x. The notion of Turing-reducibility is reflexive and transitive, and the corresponding equivalence relation, called Turing equivalence, x ≡T y ⇐⇒ x ≤T y ∧ y ≤T x generates equivalence classes known as Turing degrees. The study of Turingreducibility and degrees has been one of the most important areas of classical recursion theory. There are several other types of computational reducibilities, generally stronger than Turing-reducibility, that are important for the theory of computability. The interaction between randomness and Turing-reducibility (and other computational reducibilities not introduced here) has also been studied, and has generated fruitful applications in both directions. Another notion straddling both randomness and computability theory that has been studied extensively is that of K-triviality. A sequence x = hx1 , x2 , . . . i is K-trivial if ∀n(K(hx1 , . . . , xn i) ≤ K(n) + b for some constant b. This property is quite orthogonal to that of randomness. It is known that x is K-trivial iff every random sequence is random relative to x. Many other notions that interact with both computability theory and randomness form a part of current research, which is progressing vigorously. We refer the reader to [Nies, 2009; Downey and Hirschfeldt, 2010] where further extensive references can be found. 11
RANDOMNESS VIA MARTINGALES. OTHER FREQUENTIST DEFINITIONS
In the previous section, we considered randomness notions stronger than MartinL¨of randomness. Now we will focus on weaker notions of randomness, which are perhaps more important from a philosophical viewpoint.
11.1
Schnorr randomness
A critique of Martin-L¨ of randomness by Schnorr was that it yields too strong a notion of randomness as its notion of defining effective measure-zero is not effective enough. In order for a sequence of uniformly effective open sets G1 , G2 , . . . to define an effective measure-zero set via intersection, it is not enough, according
698
Abhijit Dasgupta
to Schnorr, that their measures effectively approach zero by just having effective bounding (e.g. as µ(Gn ) < 1/n), but we need the measures µ(Gn ) of the sets themselves to be computable real numbers (uniformly in the index n). Using this stronger criteria for being effective measure-zero, we get a weaker notion of randomness, called Schnorr randomness. Schnorr randomness has been studied extensively, but it fails to have certain regularity properties of Martin-L¨of randomness. Two examples are: (a) Unlike Martin-L¨ of randomness, Schnorr randomness does not possess a uniform test, i.e. the class of Schnorr random sequences cannot be defined as the complement of the intersection of a sequence of uniformly effective open sets G1 , G2 , . . . such that µ(Gn ) < 1/n and such that µ(Gn ) is a computable real number uniformly in n. (b) The van Lambalgen theorem fails for Schnorr randomness. In fact, there is Schnorr random sequence z = x ⊕ y such that that the two halves x and y are Turing equivalent. This does not conform well with the intuitive notion of randomness.
11.2
Randomness defined by martingales
Recall that in von Mises type definitions of randomness, one uses a the concept of a betting strategy, or more precisely a place selection rule, to select places on which to bet, and after selections are all done, one checks if the limiting frequency value has become biased or not (subsections 5.1 and 5.2). The definition does not a priori have anything to do with the amount of bet. Such properties are called stochasticity (as opposed to randomness) in the modern mathematical literature, but note that this is essentially a matter of terminology. We now introduce a concept of betting strategy involving the amount of bet, or equivalently the capital of the gambler (“total money in pocket”), at each stage of betting. We think of the infinite sequence x = hx1 , x2 , x3 , . . . i being revealed to the gambler, bit by bit, in order. Before each bit is revealed, the gambler may bet an amount a predicting the value of the bit (one of “next bit revealed will be 0” or “next bit revealed will be 1”). We will assume the fairness condition that if the prediction turns out to be correct, the gambler gains an amount of a (capital increases by a), otherwise the gambler loses the same amount a (capital decreases by a). To formalize this type of strategy, we think of a finite binary string σ of length |σ| = n − 1 as representing the n-th stage of the game, so that σ consists of the bits of the infinite sequence that have been revealed so far (before the n-th bit is revealed), and let F (σ) denote the gambler’s capital at this stage. Suppose that at stage σ, with capital F (σ), the gambler bets an amount a predicting the next bit to be 0. If the gambler turns out to be correct, then σ is
Mathematical Foundations of Randomness
699
extended to σ0 and the capital goes up by a to F (σ0) = F (σ) + a, but if incorrect, then σ is extend to σ1 and the capital goes down by a to F (σ1) = F (σ) − a. Similarly, if the gambler had predicted the next bit to be 1 (with same bet amount a) then we would have σ extending to σ1 and F (σ1) = F (σ) + a if the gambler turns out to be correct, and σ extending to σ0 and F (σ0) = F (σ) − a if the gambler is incorrect. A final case is when the gambler chooses not to bet at this stage, which is expressed by having a = 0 and F (σ0) = F (σ1) = F (σ) (no change in capital). All cases can be summarized using a single zero sum condition: capital change if next bit is 1
capital change if next bit is 0
}| { z F (σ0) − F (σ)
+
Therefore we make the following definition.
}| { z F (σ1) − F (σ)
= 0.
DEFINITION 51 Martingales. A martingale or capital betting strategy is a function F : {0, 1}∗ → R satisfying two conditions: (a) F (σ) ≥ 0, ∀σ (finiteness condition); and (b) [F (σ0) − F (σ)] + [F (σ1) − F (σ)] = 0, ∀σ (zero-sum or fairness condition). Given x ∈ {0, 1}N and a martingale F : {0, 1}∗ → R, we say that the martingale F succeeds on x if the capital becomes unbounded on outcome sequence x, i.e., if sup F (hx1 , x2 , . . . , xn−1 i) = +∞. n
An example of a martingale is where the gambler always predicts an outcome of 0 with the bet being a fixed fraction r (0 < r < 1) of the available capital. This martingale is the function recursively defined as F (σ0) = (1+r)F (σ) and F (σ1) = (1 − r)F (σ); or, if initial capital is 1, more explicitly as F (σ) = (1 + r)m (1 − r)n if σ is a string of length m + n with m zeros and n ones. Another example is where the gambler always predicts an outcome of 0 betting the entire amount of available capital (“bold play”). If the initial capital is 1, this martingale is given as follows. If 1 does not occur in σ, then F (σ) = 2|σ| , else F (σ) = 0. It can be shown that the concept of a martingale is really a generalization of place selection rule. To each place selection rule ϕ one can assign an especially simple type of martingale Fϕ (one which always uses a constant fraction of the existing capital as the next bet), such that the ϕ-selected part of x has limiting frequency 1/2 iff Fϕ does not succeed on x. A converse association is also possible. Possible definitions of randomness either using place selection rules (as done by von Mises) or using martingales are both examples of characterizations of randomness via impossibility of gambling systems, with the place selection method known as the frequentist approach, while the martingale method may be called non-frequentist.
700
11.3
Abhijit Dasgupta
A martingale characterization of randomness
A martingale F is said to be computably enumerable if the relation R defined by R(m, n, σ) ⇐⇒
m < F (σ) n+1
is computably enumerable as a subset of N × N × {0, 1}∗ .
THEOREM 52 Martingale Characterization of Martin-L¨of Randomness. A sequence x is random (i.e. Martin-L¨ of random) iff no computably enumerable martingale succeeds on x. Thus we now have three different but equivalent definitions for randomness. This last characterization in terms of martingales gives a definition of randomness using the unpredictability approach. If, instead of computably enumerable martingales, we require stronger effectiveness conditions on the martingales, we obtain weaker notions of randomness, as we will see now. In the following, we consider only rational valued martingales. DEFINITION 53. A partial computable martingale is a partial function F : {0, 1}∗ → Q satisfying, for all σ: (a) F (σ) ≥ 0; (b) If F (σ) is defined, so is F (τ ) for any prefix τ of σ; (c) F (σ0) is defined iff F (σ1) is defined, and if so, then: [F (σ0) − F (σ)] + [F (σ1) − F (σ)] = 0; (d) The relation R defined by R(σ, m, n) ⇐⇒ F (σ) = m/(n + 1) is a computably enumerable subset of {0, 1}∗ × N × N.
A computable martingale is a partial computable martingale which is total (i.e., whose domain is {0, 1}∗ ). (It can be seen that the graph of a computable martingale is computable, not just computably enumerable.) Here are the main notions of randomness arising out of these types of martingales. DEFINITION 54. A sequence is partial computably random if no partial computable martingale succeeds on it. A sequence is computably random if no computable martingale succeeds on it.
Mathematical Foundations of Randomness
701
Of course, every partial computably random sequence is computably random. It can be shown that (Martin-L¨ of) Random =⇒ Partial Computably Random =⇒ Computably Random =⇒ Schnorr Random, but none of these implications can be reversed [Nies, 2009].
11.4
Non-monotonic betting strategies
So far, we have seen two distinct types of betting strategies giving rise to notions of randomness: • Martingales, or capital betting strategies, which lead to definitions of randomness based on the failure of the martingale, such as partial computable randomness and computable randomness that we just saw. • Place selection rules, which lead to definitions of randomness based on the limiting frequency of the selected part, such as Mises-Wald-Church stochasticity (or randomness), and Church stochasticity (or randomness) that we saw in subsection 5.2. These four types just mentioned all are notions weaker than (Martin-L¨of) randomness. There is a further generalization possible in the type of betting allowed — called non-monotonic betting — that actually tightens the notions further, and makes them more robust. To understand non-monotonic betting, both for martingales and for place selection rules, suppose that the bits of the sequence x = hx1 , x2 , . . . , xn , . . . i lay covered on an infinitely long table (instead of being revealed serially one by one). The gambler now uncovers the bits in a not-necessarily increasing order, and along the way decides which places to select or to bet on. For example, the gambler may choose to first uncover the ninth place to find the value of x9 , then uncover the fourth place to find x4 , and then, based on these two observations, decide to select or bet on the seventeenth position (before uncovering it). After x17 is uncovered, the gambler may choose to next uncover either x3 or x7 , depending on whether x17 turns out to be 0 or 1, and so on. We omit the formal details and hope that the above informal description makes it intuitively clear what non-monotonic betting is. In particular, it can be applied both to martingales and to place selection rules as follows. DEFINITION 55. A sequence x is called Kolmogorov-Loveland Random if no computable non-monotonic martingale succeeds on it. A sequence x is called Kolmogorov-Loveland Stochastic if for every computable non-monotonic place selection rule, the selected part has limiting frequency = 12 .
702
Abhijit Dasgupta
A nice feature of non-monotonic betting strategies is that it does not matter whether we use “computable” or “partial computable” in the definitions above, as in each case it results in an equivalent notion. In other words, changing the above definitions to “partial computable” will not give us a stronger notion of randomness. Thus the non-monotonic forms have a kind of robustness that was not there for monotonic betting strategies, since partial computable randomness is a strictly stronger notion of randomness compared to computable randomness, and Mises-Wald-Church stochasticity is strictly stronger than Church stochasticity. Since martingales are more general than place selection rules, it follows that a martingale version will give a stronger notion of randomness than the corresponding stochastic (i.e limiting frequency after place selections) version. Thus Kolmogorov-Loveland randomness is strictly stronger than Kolmogorov-Loveland stochasticity, partial computable randomness is strictly stronger than Mises-WaldChurch stochasticity, and computable randomness strictly stronger than Church stochasticity. We summarize these notions in the following table. Each notion in this table implies the one directly below it and also the one to the right of it. Monotonicity and computability of strategy function Non-monotonic Monotonic Partial or total Partial Total
Strategy type
Martingale Place selection
Kolmogorov-Loveland Random (KLR) Kolmogorov-Loveland Stochastic (KLS)
Partial Computably Random (PCR) Mises-Wald-Church Stochastic (MWCS)
Computably Random (CR) Church Stochastic (CS)
Six Randomness Notions for Various Types of Betting Strategies In fact, with “Random” standing for Martin-L¨of randomness, “Schnorr” for Schnorr randomness, and the rest of the abbreviations as in the table, we have the following implications.
∗
Random −−−−→ KLR −−−−→ y
PCR y
−−−−→ CR −−−−→ Schnorr y
KLS −−−−→ MWCS −−−−→ CS
Implication Diagram for Weak Randomness Notions
Each arrow represents an implication, and almost all the implications above are strict (they cannot be reversed). However, for the implication marked with “∗”, it is not known whether the implication can be reversed. In other words we have:
Mathematical Foundations of Randomness
703
Question. While every Martin-L¨ of random sequence is Kolmogorov-Loveland random, is the converse true? This is perhaps the biggest open problem in current research on randomness. Many researchers feel that the answer is no, although work of Merkle et al [Merkle et al., 2006] have shown that the two notions are rather close. Nies and Miller have published a list of open problems [Miller and Nies, 2006] in the area, some of which have been solved since then.
11.5
Can we resurrect von Mises?
While Kolmogorov-Loveland randomness is the only major notion of randomness that remains close to Martin-L¨of randomness, it does not have the true spirit of von Mises’ idea of randomness, since von Mises’s definition, which is based on place selections, is a truly frequentist one, that is, it is defined in terms of limiting frequency, while the martingale notions are all defined in terms of capital growth. Therefore, even if it turns out that Kolmogorov-Loveland randomness is equivalent to Martin-L¨ of randomness, one would still be looking for a characterization of Martin-L¨ of randomness in terms of a frequentist condition. In recent literature, the term stochasticity is used for randomness defined in terms of a frequentist condition, or more precisely using the limiting frequency of place selections. Among these, the one closest to Martin-L¨of randomness is Kolmogorov-Loveland stochasticity, as the above diagram of implications indicates. Unfortunately, like Mises-Wald-Church stochasticity, Kolmogorov-Loveland stochasticity gives random sequences which do not satisfy the Law of Symmetric Oscillations, and therefore is rather far from Martin-L¨of randomness. Li and Vitanyi writes in their 2008 book [Li and Vitanyi, 2008, p. 158]: “[T]he problem of giving a satisfactory definition of infinite Martin-L¨of random sequences in the form proposed by von Mises has not yet been solved.” Thus, the search for a true frequentist characterization of Martin-L¨of randomness continues.
11.6
The Ergodic Theorem as a Frequentist Definition
The general version of the Birkhoff Ergodic Theorem is not fully effective (see [Avigad, 2009; Hoyrup, 2008]). However, we consider the version of the theorem which is known as the “law of frequencies”. We restrict to Lebesgue measure on the Cantor space. Say that a measurable map U : {0, 1}N → {0, 1}N is measure preserving if µ(U −1 [A]) = µ(A) for all measurable A, and a map T : {0, 1}N → {0, 1}N is ergodic if T is measure preserving and for all measurable A, if T −1 [A]∆A is measure-zero then µ(A) = 0 or µ(A) = 1 .
704
Abhijit Dasgupta
We now state the Birkhoff Ergodic Theorem in a slightly variant form: THEOREM 56 The Birkhoff Ergodic Theorem as the Law of Frequencies. If T : {0, 1}N → {0, 1}N is ergodic, E is measurable, and U : {0, 1}N → {0, 1}N is measure preserving, then (we put JP K = 1 if the statement P is true, else JP K = 0): n
For almost all x:
y 1 Xq k T (U (x)) ∈ E = µ(E), n→∞ n lim
k=1
i.e., the frequency with which the T -orbit of U (x) enters E approaches µ(E). We also view of this theorem, the Law of Frequencies, as a form of equidistribution (as in Weyl equidistribution): The T -orbit of U (x) is equidistributed. Now if ϕ is a place selection rule satisfying the property that for almost all x, ϕ(hx1 , x2 , . . . , xn−1 i) = 1 for infinitely many n, then the map Uϕ : {0, 1}N → {0, 1}N Uϕ (x) = ϕ-selected part of x is defined for almost all x and is a measure preserving (and continuous) map. Finally, take T to the left shift map T (hx1 , x2 , . . . i) = hx2 , x3 , . . . i, and E = N (1) = {x ∈ {0, 1}N : x1 = 1} = the basic interval consisting of sequences with 1st bit = 1. With the above setting, the condition in the von Mises definition coincides precisely with the condition in the Birkhoff Ergodic Theorem above, at least in the case when the domain of Uϕ has full measure. Furthermore, if E is allowed to range of over the basic intervals, then the condition of Borel normality of is obtained, which we view as equidistribution of the T -orbit of U (x). Perhaps the main weakness of the von Mises definition is the lack of the requirement of general equidistribution, which is illustrated in the approach of [Knuth, 1998]. The notion of equidistribution is more general than the notion of limiting frequency, while still being in the frequentist spirit. Suppose we try to strengthen the Mises-Wald-Church definition by requiring Borel normality of the ϕ-selected part, not just existence of unbiased limiting frequency, i.e., we demand that for x to be random, the ϕ-selected part of x has to be Borel normal in base 2 for every partial computable place selection rule ϕ. This does not really change anything, as all Mises-Wald-Church stochastic sequences are Borel normal. However, instead of being limited to limiting frequency, it casts the definition in terms of equidistribution of the T -orbit of U (x), where T is the left-shift map. In other words, in Mises-Wald-Church stochasticity, the condition that the place-selected part y of x must satisfy is equivalent to the equidistribution of the following sequence of sequences obtained from y: hy1 , y2 , y3 , . . . i, hy2 , y3 , y4 , . . . i, hy3 , y4 , y5 , . . . i, . . . We think that a frequentist definition of randomness should allow more general forms of equidistribution, so long as the method of forming the sequence is uniformly effective and ergodic (see [Knuth, 1998] for examples).
Mathematical Foundations of Randomness
705
For example, suppose that the natural numbers are partitioned into an infinite number of infinite uniformly computable subsets, say into the sets {1, 3, 5, 7, . . . }, {2, 6, 10, 14, . . . }, {4, 12, 20, 28, . . . }, etc. Then for a random x, if y is obtained from x by effective place selection, we expect that the sequences obtained by restricting y to each of these subsets hy1 , y3 , y5 , . . . i, hy2 , y6 , y10 , . . . i, hy4 , y12 , y20 , . . . i, . . . should be equidistributed (which follows from the Ergodic Theorem). It is not a priori clear that Mises-Wald-Church stochasticity guarantees this. (KolmogorovLoveland stochasticity allows more general forms of U , but the T operator is still the same left shift.) Perhaps the following definition can be taken to be an “ergodic generalization” of von Mises’ definition of randomness, where we think of U as “place selection”: DEFINITION 57. x ∈ {0, 1}N is random iff for all sufficiently effective E ⊆ {0, 1}N , sufficiently effective ergodic T , and sufficiently effective measure preserving U , we have: n y 1 Xq k T (U (x)) ∈ E = µ(E), lim n→∞ n k=1
i.e., the frequency with which the T -orbit of U (x) enters E approaches µ(E). The three instances of the phrase “sufficiently effective” are deliberately left open to interpretation. We ask: Question. Are there interpretations of “sufficiently effective” in the above definition which characterize Martin-L¨ of randomness? For example, it is known that with “sufficiently effective” interpreted as ∆01 (computable) in all three instances, every Martin-L¨of random sequence is random in the sense of the above definition.22 However, we are not aware of any result that positively answers the above question. 12
¨ CONCLUSION. THE MARTIN-LOF-CHAITIN THESIS
In this article, we introduced the reader to definitions of the notion of a random sequence using the three main ideas described in Section 1.6 that have dominated algorithmic randomness (cf. [Downey and Hirschfeldt, 2010]): 22 See [Gonz´ alez, 2008; Hoyrup, 2008] for a proof and a further list of problems. See also [V’yugin, 1997; V’yugin, 1999]. In [Gonz´ alez, 2008; Hoyrup, 2008], this is studied in a more general setting (such as effective probability metric spaces), and results have been obtained for Schnorr randomness, but it is stated as an open problem for Martin-L¨ of randomness. It appears that the problem of characterization of Martin-L¨ of randomness in this ergodic way is open even in the specific case of the Cantor Space with Lebesgue measure.
706
Abhijit Dasgupta
• Randomness as typicality. According to this stochastic or measure-theoretic idea, randomness of a sequence means its membership in all effective fullmeasure sets, or equivalently that the sequence “passes” all effective stochastic tests. The first major stochastic laws, the Borel strong law and Borel normality go back to 1909. Martin-L¨of randomness (1966), the first definition that is almost universally accepted now as the correct one, is defined using this approach. • Randomness as incompressibility. This is an information-complexity approach that views the sequence “from inside”. According to this idea, randomness means lack of short complete descriptions, or equivalently a high degree of algorithmic complexity, for all initial parts of the sequence. • Randomness as unpredictability. According to this approach, randomness means the impossibility of devising a successful betting strategy against the sequence in question. In particular, this means that knowledge of some part of the sequence does not help to predict any other unknown bit. Two types of definitions of randomness arise from this deeply intuitive approach: The special frequentist type definitions first put forward by von Mises as “invariance of limiting frequency under admissible place selections”, and the more general non-frequentist type found in the definitions of randomness using martingales. The ideal definition of randomness would be one which naturally and simultaneously satisfies the criteria given by these three approaches.
12.1
The Martin-L¨of-Chaitin thesis
Following Delahaye [Delahaye, 1993], we use the term Martin-L¨ of-Chaitin Thesis for the assertion that Martin-L¨ of randomness and equivalently Kolmogorov-Chaitin randomness is the correct formulation of the intuitive notion of randomness for sequences. In this sense, it parallels the classic Church-Turing thesis, and is not a mathematical proposition to be proved or disproved. The Church-Turing thesis turned out to be highly successful in capturing the intuitive notion of algorithm. Delahaye has carried out a detailed comparison between the Church-Turing thesis and the Martin-L¨ of-Chaitin thesis, and concludes that in both cases, the resulting precise definitions provide “profound insights to the mathematical and philosophical understanding of our universe.” Delahaye admits that the ChurchTuring thesis is “more deeply attested” and that the definition of randomness of sequences is “more complicated” compared to the definition of algorithm, but hopes that with time the Martin-L¨of-Chaitin thesis will reach a level of certainty similar to the Church-Turing thesis. We think that overall, Delahaye’s assertions still remain valid. In the past few decades, there has been a vast amount of research activity in the area of algorithmic randomness. Many definitions of randomness for sequences
Mathematical Foundations of Randomness
707
have been studied extensively, but none was found to be clearly superior to the Martin-L¨ of definition. Compared with other notions, it appears to be of optimal strength: Weaker notions turn out to be too weak, and the stronger ones too strong. In this way, the Martin-L¨of-Chaitin thesis has gained strength in a slow but steady fashion. The proliferation of definitions of randomness for sequences makes the field harder for non-experts, but it should not be regarded negatively. It is an indication of the richness of the area, and the associated healthy and lively activity provides refinements and insights deep into the subject. Recall that while we consider the Church-Turing thesis as more satisfying, there was an even larger number of associated notions of computability, both stronger and weaker, that were (and still are) studied fruitfully. Perhaps the strongest evidence for the Martin-L¨of-Chaitin thesis available so far is Schnorr’s theorem, which establishes the equivalence between a naturally formulated “typicality definition” (Martin-L¨of randomness) and a naturally formulated “incompressibility definition” (Kolmogorov-Chaitin randomness). Another justification of the Martin-L¨of-Chaitin thesis is provided by the simplicity of the definition of Martin-L¨of randomness within the arithmetical hierarchy. As seen in Schnorr’s theorem, x is random ⇐⇒ ∃b ∀n K(hx1 , . . . , xn−1 i) ≥ n − b. This shows that Martin-L¨ of randomness has a Σ02 definition (which also follows from the existence of a universal test). Most other definitions of randomness are more complicated, and situated at higher levels of the arithmetical hierarchy. In fact, the definitional complexity of Martin-L¨of randomness is at the lowest possible level of the arithmetical hierarchy, assuming that any definition of randomness must satisfy the two basic axioms: (a) No random sequence should be computable. (b) The set of random sequences has full-measure. It then follows that no definition of randomness can be Π02 or simpler, as it is a standard “basis theorem” that any Π02 set of full-measure contains computable sequences. We also doubt if the resolution of the question of whether there are KolmogorovLoveland random sequences which are not Martin-L¨of random will have much impact on the Martin-L¨ of-Chaitin thesis. However, a purely frequentist natural characterization of Martin-L¨of randomness can substantially increase the strength of the Martin-L¨of-Chaitin thesis. While the characterization in terms of computably enumerable martingales is a nice “unpredictability definition”, it is not as intuitive nor as frequentist as the von Mises definition. This is perhaps the most unsatisfying gap in the current state of affairs.
708
Abhijit Dasgupta
To summarize, we believe that while the Martin-L¨of-Chaitin thesis is not (yet) as strong as the Church-Turing thesis, the two problems of the introduction, namely defining randomness for sequences and strings that captures our mathematical intuition of these objects, have essentially been solved quite satisfactorily as described in this article. It is perhaps not too surprising that the definition of randomness, which in all cases presupposes the definition of algorithm, has turned out to be more complicated than the definition of algorithm itself. ACKNOWLEDGMENTS The author wishes to thank Ananda Sen for help with some of the references. The author is also indebted to Prasanta S. Bandyopadhyay and the anonymous referee for several useful suggestions. BIBLIOGRAPHY [Aczel, 2004] A.D. Aczel. Chance: A Guide to Gambling, Love, the Stock Market & Just About Everything Else. Thunder’s Mouth Press, New York, 2004. [Avigad, 2009] J. Avigad. The metamathematics of ergodic theory. Annals of Pure and Applied Logic, 157(2-3):64–76, 2009. [Bailly and Longo, 2007] F. Bailly and G. Longo. Randomness and determinism in the interplay between the continuum and the discrete. Mathematical Structures in Computer Science, 17(02):289–305, 2007. [Becher and Figueira, 2002] V. Becher and S. Figueira. An example of a computable absolutely normal number. Theoretical Computer Science, 270(1-2):947–958, 2002. [Belshaw and Borwein, ] A. Belshaw and P. Borwein. Strong Normality of Numbers. http: //www.cecm.sfu.ca/personal/pborwein/PAPERS/P211.pdf. Contains material added and updated after Belshaw’s masters thesis. [Belshaw, 2005] A. Belshaw. On the normality of numbers. Master’s thesis, Simon Fraser University, 2005. [Beltrami, 1999] E.J. Beltrami. What is random? Chance and Order in Mathematics and Life. Copernicus (Springer-Verlag), New York, 1999. [Bennett, 1979] C.H. Bennett. On Random and Hard-to-Describe Numbers. Technical Report RC-7483, IBM Watson Research Center, Yorktown Heights, New York, 1979. reprinted in Randomness And Complexity, from Leibniz To Chaitin (C. Calude, ed.), World Scientific 2007, pp 3–12. [Bennett, 1998] D.J. Bennett. Randomness. Harvard University Press, Cambridge, Mass., and London, England, 1998. [Boolos and Jeffrey, 1989] G. S. Boolos and R. C. Jeffrey. Computability and Logic. Cambridge University Press, 1989. [Borel, 1909] E. Borel. Les probabilit´ es d´ enombrables et leurs applications arithm´ etiques. Rend. Circ. Mat. Palermo, 27:247–271, 1909. [Calude, 1994] C. S. Calude. Information and Randomness: An Algorithmic Perspective. Springer Verlag, 1994. [Calude, 2000] C. S. Calude. Who is afraid of randomness? Technical Report CDMTCS-143, University of Auckland, New Zealand, 2000. [Calude, 2005] C. S. Calude. Algorithmic randomness, quantum physics, and incompleteness. In Proceedings of the Conference on Machines, Computations and Universality(MCU2004), volume 3354, pages 1–17. Springer, 2005. [Chaitin, 1992] G. J. Chaitin. Algorithmic Information Theory. Cambridge University Press, 1992.
Mathematical Foundations of Randomness
709
[Church, 1940] A. Church. On the concept of a random sequence. Bull. Amer. Math. Soc., 46:130–135, 1940. [Davis, 1980] M. Davis. What Is a Computation? In Lynn Arthur Steen, editor, Mathematics Today, pages 241–267. Vintage Books, New York, 1980. [de Laplace, 18191952] Pierre-Simon de Laplace. A Philosphical Essay on Probabilities. Dover, translated from 6th french ed edition, 1819,1952. [Delahaye, 1993] J.P. Delahaye. Randomness, Unpredictability and Absence of Order: The Identification by the Theory of Recursivity of the Mathematical Notion of Random Sequence. Philosophy of Probability, pages 145–167, 1993. [Downey and Hirschfeldt, 2010] R. G. Downey and D. Hirschfeldt. Algorithmic Randomness and Complexity. Springer, 2010. [Eagle, 2005] A. Eagle. Randomness is unpredictability. The British Journal for the Philosophy of Science, 56(4):749–790, 2005. [Feller, 1968] W. Feller. An Introduction to Probability Theory and its Applications. Vol. 1. John Wiley & Sons, New York, 1968. [G´ acs, 1974] P. G´ acs. On the symmetry of algorithmic information. Soviet Math. Dokl., 15:1477– 1480, 1974. [Gonz´ alez, 2008] Crist´ obal Rojas Gonz´ alez. Randomness and Ergodic Theory: An Algorithmic ´ Point of View. PhD thesis, Ecole Polytechnique, Paris, France, and Universit` a di Pisa, Italy, 2008. [Hjorth and Nies, 2007] G. Hjorth and A. Nies. Randomness via effective descriptive set theory. Journal of the London Mathematical Society, 75(2):495, 2007. [Hoyrup, 2008] Mathieu Hoyrup. Computability, Randomness and Ergodic Theory on Metric Spaces. PhD thesis, University Paris Diderot, France, and Universit` a di Pisa, Italy, 2008. [Jauch, 1990] J. M. Jauch. Are Quanta Real? A Galilean Dialogue. Indiana University Press, 1990. [Knuth, 1998] D.E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 3rd ed. Addison-Wesley Reading, Mass, 1998. [Kolmogorov, 1963] A.N. Kolmogorov. On tables of random numbers. Sankhy¯ a, Ser. A, 25:369– 376, 1963. [Kolmogorov, 1965] A.N. Kolmogorov. Three approaches to the quantitative definition of information. Problems Inform. Transmission, 1(1):1–7, 1965. [Kolmogorov, 1974] A.N. Kolmogorov. Laws of information conservation (non-growth) and aspects of the foundation of probability theory. Problems Inform. Transmission, 10:206–210, 1974. [Kuipers and Niederreiter, 1974] L. Kuipers and H. Niederreiter. Uniform distribution of sequences. John Wiley & Sons, New York, 1974. [Li and Vitanyi, 2008] M. Li and P. Vitanyi. An introduction to Kolmogorov complexity and its applications. Springer, 3rd edition, 2008. [Longo, 2009] G. Longo. Randomness and Determination, from Physics and Computing towards Biology. In SOFSEM 2009: Theory and Practice of Computer Science: 35th Conference ˇ on Current Trends in Theory and Practice of Computer Science, Spindleruv Ml` yn, Czech Republic, January 24-30, 2009. Proceedings, page 49. Springer, 2009. [MacHale, 1993] D. MacHale. Comic Sections: The book of mathematical jokes, humour, wit, and wisdom. Boole Press, Dublin, 1993. [Martin-L¨ of, 1966] P. Martin-L¨ of. The definition of random sequences. Inform. and Control, 9:602–619, 1966. [Merkle et al., 2006] W. Merkle, J.S. Miller, A. Nies, J. Reimann, and F. Stephan. Kolmogorov– Loveland randomness and stochasticity. Annals of pure and applied logic, 138(1-3):183–210, 2006. [Miller and Nies, 2006] J.S. Miller and A. Nies. Randomness and computability: open questions. Bulletin of Symbolic Logic, pages 390–410, 2006. [Mlodinow, 2008] L. Mlodinow. The Drunkard’s Walk: How Randomness Rules Our Lives. Pantheon Books, New York, 2008. [Moschovakis, 1980] Y. N. Moschovakis. Descriptive Set Theory, volume 100 of Studies in Logic and the Foundations of Mathematics. North-Holland Publishing Company, Amsterdam, 1980. [Nies, 2009] A. Nies. Computability and Randomness. Oxford University Press, 2009.
710
Abhijit Dasgupta
[Odifreddi, 1992] P. Odifreddi. Classical Recursion Theory, volume 125 of Studies in Logic and the Foundations of Mathematics. North-Holland (Elsevier), Amsterdam, 1992. [Penrose, 1989] R. Penrose. The Emperor’s New Mind. Oxford University Press, New York, 1989. [Rogers Jr, 1987] H. Rogers Jr. Theory of recursive functions and effective computability. MIT Press, Cambridge, Mass., 1987. [Shen et al., 20??] A. K. Shen, Uspensky V. A., and Vereshchagin N. K. Kolmogorov Complexity and Randomness. To appear, 20??. [Sierpinski, 1917] W. Sierpinski. Demonstrationelementaire du theoreme de M. Borel sur les nombres absolument normaux et determination effective dun tel nombre. Bull. Soc. Math. France, 45:127–132, 1917. [Sipser, 1997] M. Sipser. Introduction to the Theory of Computation. PWS Publishing Company, 1997. [Solomonoff, 1960] R.J. Solomonoff. A preliminary report on a general theory of inductive inference. Technical Report ZTB-138, Zator Company, Cambridge, Mass., 1960. (November, 1960). [Solomonoff, 1964] R.J. Solomonoff. A formal theory of inductive inference, part 1 and part 2. Inform. Contr., 7:1–22, and 224–254, 1964. [Stewart, 2002] I. Stewart. Does God Play Dice?: The New Mathematics of Chaos. Blackwell Publishers, 2002. [Svozil, 1993] K. Svozil. Randomness & undecidability in physics. World Scientific, 1993. [Taleb, 2005] N.N. Taleb. Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets. Random House Trade Paperbacks, New York, 2005. [Tromp, 2009] J. Tromp. Binary Lambda Calculus and Combinatory Logic. http://homepages. cwi.nl/~tromp/cl/cl.html, 2009. PostScript file of paper updated on March 13, 2009. [Turing, 1992] A.M. Turing. A note on normal numbers. Collected Works of AM Turing, Pure Mathematics, North-Holland, Amsterdam, pages 117–119, 1992. [Uspensky, 1983] V.A. Uspensky. Post’s Machine. Little Mathematics Library. Mir Publishers, Moscow, 1983. [van der Waerden, 1927] B. L. van der Waerden. Beweis einer baudetschen vermutung. 15:212– 216, 1927. [van Lambalgen, 1987a] M. van Lambalgen. Random Sequences. PhD thesis, University of Amsterdam, 1987. [van Lambalgen, 1987b] M. van Lambalgen. Von Mises’ definition of random sequences reconsidered. Journal of Symbolic Logic, 4:725–755, 1987. [van Lambalgen, 1989] M. van Lambalgen. Algorithmic information theory. Journal of Symbolic Logic, pages 1389–1400, 1989. [van Lambalgen, 1990] M. van Lambalgen. The axiomatization of randomness. Journal of Symbolic Logic, pages 1143–1167, 1990. [van Lambalgen, 1996] M. van Lambalgen. Von Mises’ axiomatization of randomness reconsidered. In LS Shapley TS Ferguson and JB MacQueen, editors, Statistics, Probability and Game Theory, papers in honor of David Blackwell, volume 30 of IMS Lecture Notes and Monograph series (Hayward, CA), 1996. [Ville, 1939] J. Ville. Etude Critique du Concept de Collectif. Gauthier-Villars, Paris, 1939. [Volchan, 2002] Sergio B. Volchan. What is a random sequence? 109:46–63, 2002. [Von Mises, 1919] R. Von Mises. Grundlagen der Wahrscheinlichkeitsrechnung. Math. Zeitschrift, 5:52–99, 1919. [Von Mises, 1981] R. Von Mises. Probability, statistics and truth (1957). Dover Publications, New York, 1981. [V’yugin, 1997] Vladimir V. V’yugin. Effective convergence in probability and an ergodic theorem for individual random sequences. SIAM Theory of Probability and Its Applications, 42(1):39–50, 1997. [V’yugin, 1999] Vladimir V. V’yugin. Ergodic theorems for individual random sequences. Theoretical Computer Science, 207(4):343–361, 1999. [Wald, 1936] A. Wald. Sur la notion de collectif dans le calcul des probabilit´ es. C. R. Acad. Sci., 202:1080–1083, 1936. [Yurtsever, 2000] U. Yurtsever. Quantum mechanics and algorithmic randomness. Arxiv preprint: arXiv:quant-ph/9806059v2, pages 1–8, 2000.
Part VII
Probabilistic and Statistical Paradoxes
This page intentionally left blank
PARADOXES OF PROBABILITY Susan Vineberg
1
INTRODUCTION
Numerous puzzles have arisen from the application of probability. These range from simple cases that are easily resolved to others that have generated considerable philosophical discussion and remain topics of lively debate. Quite a few classic puzzles arise from basic misapplications of the standard axioms of probability. While the mistakes involved are common, and are indicative of how prone humans are to fallacious probabilistic reasoning, the resolution of such problems, while instructive, is for the most part non-controversial. The solution to some of these involves a bit more than elementary probability theory, but the principles needed are fairly unproblematic. Other cases involve more substantive principles of probability and/or decision. In many instances there are generally accepted resolutions, although the principles involved are more open to question than the basic axioms of probability. As such, these, along with the very simple problems involving clear mistakes of reasoning, do not seem very paradoxical. However, other problems lack a single clear solution, and involve a number of hotly contested issues. It will be useful in introducing the subject of probabilistic paradoxes to begin by examining a family of puzzles known as Bertrand’s paradoxes, as well as some variations that have been proposed by others. These vary in terms of the principles that they may be said to depend upon, and accordingly vary considerably in terms of their ease of resolution. After examining these, the concept of paradox will be considered, which will give rise to a way of categorizing the so-called paradoxes of probability. A wide variety of paradoxes will then be considered by type.
1.1
Bertrand’s Paradoxes
Each of Bertrand’s paradoxes turns on alternative ways of counting up possibilities. This links these paradoxes with the notorious principle of indifference that forms the basis for the classical interpretation of probability. The principle of indifference requires that Events for which we have no reason to expect (favor) one over the other are to be taken as equipossible. Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
714
Susan Vineberg
The classical interpretation of probability then dictates that the probability of an event is the ratio of favorable cases among equipossible ones, as given by the principle of indifference. The alternative ways of counting up possibilities lead to inconsistent assignments of probability, which has generally been taken as posing a problem for the classical interpretation. However, Bertrand’s paradoxes vary in the pressure that they put on the principle of indifference. Some of these indicate the need for mild reform, whereas others suggest that the principle, and hence the classical interpretation of probability, may be simply unworkable. The first case is easily resolved, and does not put serious pressure on the principle of indifference. It involves the following setup involving a box with three drawers, and accordingly is called Bertrand’s box paradox: One drawer contains two gold coins, one contains two silver coins, and one contains one gold and one silver coin. A drawer is chosen at random, and a silver coin is drawn from that drawer. What is the probability that it came from the drawer with two silver coins?1 There are two ways to count alternatives in this problem, and accordingly we obtain two answers: Answer 1: Given that a silver coin was drawn, it came from either the drawer with one silver and one gold coin, or the drawer with two silver coins. We have no reason to prefer one over the other, so the probability that the coin came from the drawer with two silver coins is 1/2. Answer 2: The silver coin drawn was either the one silver coin in the drawer with one silver and one gold coin, or it was the first silver coin in the drawer with two silver coins or it was the second silver coin in that drawer. There is no reason to favor any of these possibilities over the others, but two are favorable, so the probability is 2/3. Here we have two different answers to the question that derive from the principle of indifference. Each answer has some plausibility, as evidenced by the fact that when the puzzle is posed to students and others, it is common to hear each answer defended in the ways given above. The conflicting answers suggest a problem, and the principle of indifference may be thought to be the culprit. Where the principle is read as requiring equal probability for events where there is no recognized reason for favoring one possibility over another, then it leads to both answers. However, while those who propose answer 1 may fail to recognize a reason to favor the drawer with one silver coin over that with two, there is in fact evidence for the latter. This is a consequence of some basic assumptions of the problem. 1 There are various equivalent formulations of Bertrand’s box problem. A common version (the three coins) involves a hat containing three quarters. One is an ordinary quarter, one is two-headed and the third has two tails. A quarter falls out, heads up. What is the probability that it has two heads? Yet another version (The three card game) involves three cards. One is black on both sides, one is white on both sides and one is black on one side and white on the other. Given that one draws a card that is black on one side, what is the probability that the other side is black?
Paradoxes of Probability
715
Let S be the claim that a silver coin is drawn. Let D2S be the claim that the drawer with two silver coins is chosen. Let D1S be the claim that the drawer with one silver coin and one gold coin is chosen. Bayes’ Theorem yields: P (D2S /S) = P (S/D2S ) × P (D2S )/P (S) = 1 × (1/3)/1/2 = 2/3 P (D1S /S) = P (S/D1S ) × P (D1S )/P (S) = 1/2 × (1/3)/1/2 = 1/3 Drawing a silver coin is thus evidence that the drawer with two silver coins was selected, relative to the basic assumptions of the problem, namely that each drawer initially has an equal chance of being selected and each coin in the chosen drawer has an equal chance of being selected. Thus, the reasoning in answer 1 that proceeds from taking the alternatives to be that one drew from either the drawer with one silver coin or from the drawer with two silver coins to the conclusion that each is equally likely is mistaken. A simple application of Bayes’ theorem, which is a consequence of the basic axioms of probability, together with the idea that E is evidence for H iff P (H/E) > P (H), resolves the conflict between the two answers by revealing the faulty reasoning in answer one. This example involves receiving evidence when a silver coin is drawn, and thus some principle of evidential relevance is required. Of course, the idea that E is evidence for H iff P (H/E) > P (H) is a cornerstone of Bayesianism, and is applied to what may be considered prior probabilities. However, there is no controversy in this case about what these are, nor any reason to balk here at the criterion of evidence, and so we have a satisfactory resolution of the problem. There remains the issue that the faulty answer seems to stem from an application of the principle of indifference. However, rather than repudiating the principle, Bertrand’s box shows that it must be read as requiring not that equal probability be assigned when one recognizes no difference between two options, but where there is no available evidence that distinguishes between them. While those who accept answer 1 might recognize no difference between the alternatives that one has drawn from the drawer with one gold and one silver coin, and the drawer with two silver coins, the basic assumptions of the problem show via Bayes’ theorem that drawing a silver coin should be taken as evidence of having drawn from the drawer with two silver coins. The resolution for Bertrand’s box problem does not work in other of the socalled Bertrand cases where the principle of indifference seems to lead to conflicting answers, but in which no evidence is received. Consider drawing balls from an urn whose contents are unknown, and the probability of drawing a red ball.2 The alternatives can be taken to be that one draws a red ball or that one draws a ball that is not red. It seems that there is no evidence to favor one over the other, thus the principle of indifference, as clarified above, gives a probability for drawing a red ball of one half. But, if we apply the same reasoning to drawing a blue ball 2 This
example is discussed in [Kyburg, 1970].
716
Susan Vineberg
and then to drawing a green ball, we get P r(red) = P r(blue) = P r(green) = 1/2, in violation of the probability axioms, which require that the probability of all the possibilities together sum to one. It may be thought however that there are more ways in which a ball can fail to be red, i.e. it can be blue, or green, or yellow, etc., than it can be red.3 This provides a reason not to treat drawing a red ball as equipossible with all such other possibilities together, although it would be a stretch to suggest that this background provides evidence for thinking it is more likely that the ball is not red, than that it is. Rather this case is one in which we lack evidence altogether. Instead of appealing to some objective evidence to distinguish between possibilities, the suggestion has been made that the conflicting answers delivered by the principle of indifference can be avoided by insisting that consideration of the possibilities fully respect the symmetries in the problem. When the possibility of drawing a red ball is compared just with that of drawing a non-red ball, the symmetry with drawing some other colored ball is ignored. Here invoking symmetry considerations can be used to block the applications of the principle of indifference leading to incoherence. However, unlike in the case of Bertrand’s box, we are not left with a definitive probability for drawing a red ball. Here it may be tempting to suppose, as subjective Bayesianism would have it, that one may rationally adopt any one of a variety of priors for the probability of selecting a red ball. But in the face of little real basis for assigning any particular value, perhaps one should not assign any definite probability or even a non-trivial range of probabilities. Without some assumed background that delimits the possibilities, the question itself seems ill posed. The incoherence that arises from applying the principle of indifference first to the possibility of drawing a red ball, and then to drawing blue and green balls, can be seen as stemming from an increasing set of possibilities. Fixing a probability for drawing a red ball requires specifying a set of alternatives, which may just be that the ball is non-red, but once the space of alternatives is divided into more specific possibilities, we are asking for the probability of drawing a red ball in a different scenario. When the background conditions delimit the alternatives to drawing a blue or green ball, symmetry requires that the alternative to a red ball is not that it is non-red, but that it may be green or blue. With the alternatives fixed in this way, the principle of indifference does not lead to incoherence, and it seems reasonable enough to hold in accordance with it that P r(red) = P r(blue) = P r(green) = 1/3. While the paradox proper dissolves, we may still wonder about the legitimacy of supposing that the colors are evenly distributed. There are other examples in which symmetry considerations can be invoked to avoid the conflicting answers of the unqualified principle of indifference, two of which will be noted here. One is Van Fraassen’s cube factory in which a factory produces cubes with side-length between 0 and 1 foot.4 The first question asks for the probability that a cube chosen at random has a side length between 0 and 1/2 square feet? Following the principle of indifference, the answer appears to be 1/2. 3 [Weatherford, 4 See
1982] [van Fraassen, 1989; H´ ajek, 2007].
Paradoxes of Probability
717
If we ask instead for the probability that a randomly chosen cube has face area between 0 and 1/4 square feet, thinking of the cubes as constructed in a way that is uniformly distributed over face area, the answer would seem to be 1/4. But, if the question is posed in terms of the probability of selecting a randomly chosen cube with volume between 0 and 1/8 cubic feet, we get an answer of 1/8. Here applying the principle of indifference to the different descriptions yields different probabilities for one and the same state, since the cubes with side between 0 and 1/2 feet, and those with area between 0 and 1/4 square feet and those with volume 1/8 cu feet are the same cubes. Perhaps the most famous of Bertrand’s problems concerns the probability given a circle that a randomly chosen chord is longer than the side of an inscribed equilateral triangle. There are three variables that seem uniformly distributed, which lead by the principle of indifference to three different answers: Answer 1: By rotating any such triangle, it may be assumed that one end of the chord lies on the vertex of the triangle. Any chord whose other end lies between the other vertices of the triangle will be longer than the side of the triangle. The endpoint of any such chord lies on an arc with 1/3 the circumference of the circle, so the probability would seem to be 1/3. Answer 2: By rotating the triangle it may be assumed that the chord is perpendicular to one side of the triangle. Next consider a radius of the circle that is perpendicular to the side of the triangle to which the chord is parallel. The midpoint of the chord will then lie on that radius. The chord is longer than the side of the triangle, just in case it is closer to the center of the circle than is the side of the triangle. The side of the triangle must bisect the radius and so the probability is 1/2. Answer 3: Given a point within the circle, consider a chord that has the point as a midpoint. The chord will be longer than a side of the inscribed triangle provided that the point falls within an inner circle of radius 1/2 that of the original circle. The inner circle has area 1/4 that of the original circle and so the probability is 1/4. Poincare and E.T. Jaynes both argued that geometric symmetries dictate the second answer [Jaynes, 1973]. Van Fraassen applies the idea to the cube problem, showing that under measures that are dilation invariant [that is measures m such that for positive k, m(a, b) = m(ka, kb)], the principle of indifference yields the same probability if we consider cube length and cube area. However, as Jaynes and van Fraassen note, there remain Bertrand-like cases that are apparently immune to such maneuvers. In an example due to Von Mises, we consider a glass with 10oz of wine and water. Suppose at least 1 oz is water and 1oz is wine. What is the probability that the glass contains at least 5oz of water? If we consider the proportion of water to the total and then the proportion of wine to water, we find, as in the case of the cubes, that a uniform distribution on the former leads to a non-uniform distribution on the later. However, in this case, there appears to be nothing that fixes an invariant measure that will yield a unique answer.
718
Susan Vineberg
2
PUZZLES AND PARADOXES
We have seen that the puzzle created by Bertrand’s box is easily resolved. There are ways of avoiding inconsistent answers using the principle of indifference in the case of the colored balls, the cube factory, and Bertrand’s chord paradox, although these solutions do still leave some questions. However, the water and wine case remains a problem for the principle of indifference. An option that is often suggested is that the principle of indifference should simply be abandoned. Still, the principle retains some appeal even though it is not quite clear how exactly it could be reformed so as to avoid all of the difficulties that its application generates. It is the seeming plausibility of the principle that makes these cases seem paradoxical. But the fact that the problem in each case may be resisted by limiting or giving up the principle of indifference suggests that these are not to be regarded as cases of genuine paradoxes.5 Settling this requires considering the question of what constitutes a paradox. Sainsbury [1988] characterizes a paradox as an apparently unacceptable conclusion derived by apparently acceptable reasoning from apparently acceptable premises. In the case of a paradox, appearances deceive as either the premises, the reasoning, or the claim that the conclusion is unacceptable must be defective. Often paradoxes are characterized as involving a contradiction reasoned to from intuitively plausible premises. This is essentially just a variation on Sainsbury’s characterization, because where the conclusion is unacceptable, its denial can be offered as a plausible premise, yielding a contradiction. Cases such as Bertrand’s box in which two or more conflicting answers are offered can be seen as fitting this pattern insofar as we are willing to take the assumptions of each as premises of a single argument. However, in many instances, this one included, those who are tempted by one answer typically reject the others and their premises, so that they cannot be assimilated to Sainsbury’s definition. Instead, we would do better to identify a subclass of paradoxes in which there are two or more conflicting answers to a problem, each involving seemingly acceptable reasoning from plausible premises, where it is at least initially unclear which premises and/or which reasoning should be discarded. This subclass can be covered by maintaining that a paradox leads to contradictory conclusions from assumptions that have a claim to reasonableness by way of seemingly correct reasoning. 5 Adding to this are doubts that we should generally expect uniform distributions in cases where we have no reason to suppose that a random process is at work in producing that distribution. This concern applies in the case of drawing balls out of an urn where it is unclear as to the proportion of red balls. Thus, there are two distinct reasons to think that the principle of indifference is flawed. The fact that it can be applied so as to lead to multiple answers is clearly a defect. It must be discarded if it cannot be delimited in all cases so as to avoid this. Beyond this though, it seems that it is not compulsory in cases where there is no reason to assume that the distribution is the result of a random process.
Paradoxes of Probability
719
Some will see this characterization as too broad, as it counts as paradoxes cases in which the apparent unacceptability of the premises, conclusions and reasoning is easily resolved, as in the case of Bertrand’s box. Rather it might be held that a genuine paradox is one in which an unacceptable conclusion is derived from not just assumptions that are initially plausible, but which are so intuitive or seemingly well supported that they are strongly resistant to revision. Examples of non-probabilistic paradoxes that have this character include at least some versions of the liar and sorites paradoxes. Whether any of the probability puzzles that have been discussed in the literature are genuine paradoxes in this sense remains an open question. While there are lots of puzzles that have been discussed that are quite easily resolved, such as Bertrand’s box paradox, and some others that are more complex, but nonetheless seem rightly classified as less than fully paradoxical, including each of the other Bertrand paradoxes, there are a number of puzzles for which there remains substantial disagreement, and thus might rise to the level of full paradox. However, it may be that the disagreements arise simply because there are various aspects of probability that remain poorly understood, and that even those puzzles that have spawned considerable research in recent years will eventually lose their paradoxical status in a way that the liar and sorites paradoxes have continued to resist. Despite the fact that the above definition will count as paradoxical those puzzles of probability that are easily resolved, and thus perhaps unworthy of the designation, it is nevertheless worthwhile to it adopt here. On this definition, a paradox of probability can be understood as leading to contradictory claims from assumptions concerning probability that have a claim to reasonableness by way of seemingly correct reasoning, where the reasoning may itself be probabilistic. It is a virtue that this counts as paradoxical those puzzles involving probability that are often referred to as paradoxes of probability, which include those that are in fact rather easily resolved. Nevertheless, even many such simple problems of probability have the capacity to perplex, and so it seems acceptable to count them as paradoxes. More importantly, this characterization gives rise to a natural way of categorizing the probabilistic paradoxes in terms of the content of the premises and the type of probabilistic reasoning involved in generating the paradox. A number of paradoxes are easily resolved by consideration of simple deductions from the basic axioms of probability, sometimes in conjunction with Bayes’ rule. We have already seen one example of this in Bertrand’s box, and others will be noted below. Of course, people frequently violate the rules of probability in ordinary reasoning, as Tversky and Kahneman [1974] have shown, but this does not suggest that the probability axioms fail to capture correct probabilistic reasoning. However, the fact that such reasoning violating the probability axioms seems correct to many supports the broad definition of paradox that would include those puzzles that arise from simple errors of probabilistic reasoning. A second class of paradoxes involves more substantive probabilistic assumptions
720
Susan Vineberg
and reasoning. These paradoxes cover those involving infinite spaces, partitioning issues, and principles of evidence. Among these are the doomsday paradox, Simpson’s paradox and the Sleeping Beauty problem. The Bertrand paradoxes that do not turn on a failure to account for the evidence, in which symmetry issues arise, belong in this category. A third class of paradoxes involves not only probabilistic assumptions and/or reasoning, but also the assignment of utilities and the computation of expected value. These include, the St. Petersburg and related paradoxes, the two envelope problem, and Newcomb’s problem. In each of these cases, it is the computation of expected value that is at least sometimes fingered as the source of the problem.
3
3.1
SIMPLE PUZZLES OF PROBABILITY
The Monty Hall Problem
One of the most famous puzzles involving probability derives from the classic game show in which the host, Monty Hall, presents a contestant with 3 doors and announces that there is a prize behind only one of the doors. The contestant chooses a door, say door 1, after which Monty offers to show what is behind one of the doors that the contestant did not select. Monty then reveals that there is no prize behind one of the other doors, say door 2, and offers the contestant the chance to switch to door 3. The question that is usually posed is whether the contestant should switch, but we may also ask what the contestant’s probability for winning should be after Monty reveals that there is no prize behind door 2.6 It is assumed here that there is an equal chance that the prize was placed behind each of the three doors. This can be taken to follow from the principle of indifference or else can be built into the problem, so as to avoid any reliance on the principle. It is to be assumed that Monty knows where the prize is and that he will always act on this information to show a door that does not contain a prize. The contestant is taken to be aware of this. The first reaction that many have is that the probability that the prize is behind door 1 should remain the same as the probability that it is behind door 3. Accordingly many people think that the probability is 1/2 for the two remaining doors and so there is no point to switching. It is reasoned that since you know that Monty will reveal what is behind one of the doors not selected, learning that there is no prize behind door 2 (as opposed to learning this about 3) provides no relevant information about whether the prize is behind door 1. But this is mistaken. Given the assumptions above, the probability that the prize is behind door 1 (D1) is indeterminate after Monty shows that there is no prize behind door 2. However, if it is assumed that if D1 is true, Monty uses, say, the toss of a fair coin to determine whether he reveals what is behind door 2 or 6 A structurally identical problem involves three prisoners, in which two are to be executed and the other set free, and the warden gives one of the prisoners the name of another prisoner who is to be executed.
Paradoxes of Probability
721
door 3, then Bayes’ rule can be applied to calculate the posterior probability for D1: Let D1, D2, D3, be the propositions that the prize is behind doors 1, 2 and 3 respectively, and let S2, S3 be the propositions that Monty reveals the contents behind 2 and 3. Let Po be the probability before Monty reveals the contents of one of the doors and P1 the probability afterwards. Po (D1) = Po (D2) = Po (D3) P1 (D1) = Po (D1/S2) = Po (D1&S2)/Po (S2) = P1 (D3) = Po (D3/S2) = Po (D3&S2)/Po (S2) =
1/6 1/2 1/3 1/2
= 1/3 = 2/3
With the assumption that Monty chooses randomly between doors 2 and 3, when he is unconstrained in his choice of those two by the location of the prize, it follows that being shown that there is no prize behind door 2 is positively relevant to the prize being behind door 3. This establishes that it is wrong to think that because it is known that Monty will reveal a door, that revealing a particular door cannot provide evidence as to which door the prize is behind. However, if Monty does not choose randomly in the case where he is unconstrained by the location of the choice, the evidential relevance of his revealing no prize behind door 2 can differ. For example, suppose it is known that Monty will always reveal door 2 except where he is constrained not to because the prize is behind door 2. In this case the posteriors are as follows: P1 (D1) = Po (D1/S2) = Po (D1&S2)/Po (S2) = P1 (D3) = Po (D3/S2) = Po (D3&S2)/Po (S2) =
1/3 2/3 1/3 2/3
= 1/2 = 1/2
Since Monty will show door 2 here except where he is constrained not to, doing so does not offer evidence to distinguish between D1 and D3. This scenario establishes an upper bound on the probability that may reasonably be attributed to D1 after Monty reveals one of the other doors. Of course, Monty could adopt a rule of never showing door 2 unless he was constrained to do so. Assuming that the contestant knows this, then it would be known that the prize is behind 3 once 2 is revealed. It might be specified that this possibility is to be excluded by the rules of the game, but short of a procedure that would make the location of the prize certain once one of the doors not chosen is revealed, Monty could still adopt a rule that would make it virtually certain that the prize is behind door 3 in the event that he reveals 2. In any case, where Monty’s procedure for choosing a door to reveal is unspecified, 0 < P1 (D1) ≤ 1/2. Since the value of P1 (D1) never exceeds a half and may be less, even where Monty’s rule is unknown, the contestant should always switch, unless he knows that Monty will always reveal door 2 except when constrained not to, and even in that case there is no harm in switching. The Monty Hall problem has been much discussed, in part because of the strong initial temptation to think that Monty does not provide evidence that distinguishes
722
Susan Vineberg
between the two remaining doors when he opens one of the doors not chosen. There have even been quite a few mathematicians and statisticians who have insisted that the probability of the prize being behind the contestant’s chosen door moves from 1/3 to 1/2. Despite this, the puzzle belongs in the category of those that are easily resolved using the probability axioms and the concept of evidential relevance. As in the structurally similar case of Bertrand’s box, an application of Bayes’ rule, using only the basic assumptions of the problem, yields a definitive answer.
3.2
The Birthday Paradox
A problem with an answer that is surprising to many involves the chance that at least two people in a group have the same birthday. Given a group of more than 22 people, there is a greater than 50-50 chance that two have the same birthday. In a group of 10, there is a greater than 10% chance of this. These facts can be shown via a straightforward calculation involving the possible combinations: Let n be the # of people in the group. Ignoring the possibility of a leap year birthday, the probability is given by: 1 − 365!/365n (365 − n)! Although many imagine that the chance is actually much less than 50-50 of two people in a group of 23 having the same birthday, the answer itself is entirely uncontroversial. The complexity of the formula for determining the probability suggests that people are not relying on anything like the formula in estimating the probability, but rather a much simpler conceptualization, which in effect discounts many of the possible combinations. Perhaps the reasoning proceeds something like this: It is highly improbable that any two people chosen at random have the same birthday, so it imagined that any two such people have different birthdays. Of course, assuming that those two people have different birthdays, a third person chosen at random is highly unlikely to share either of their birthdays. Continuing on in this way, the possibility that two in 23 have the same birthday still looks unlikely. But such reasoning involves discounting the very small probability that two people chosen at random have the same birthday, and the cumulative effect of small probabilities as the number of people increases. This suggestion as to the source of the error in the birthday paradox is somewhat similar to one of the assumptions that generates the lottery and preface paradoxes. In the lottery paradox, it is assumed that a ticket is purchased from a large number of tickets, one of which is assured of winning. Given the large number of tickets, it seems reasonable to accept of any given ticket that it will not win. From this it appears to follow that none of the tickets will win, contradicting the assumption that one ticket will win.7 This paradox has generated considerable discussion. Many would resolve it by questioning the assumption that any given ticket will 7 The preface paradox is just a variation of the lottery paradox that stems from the practice of taking responsibility for the errors in one’s book in the preface. The paradox arises because
Paradoxes of Probability
723
not win, arguing that each should only be taken to have a very low probability of winning, and that the assumption that a given ticket will not win does not follow from this. It is something like this rejected assumption that appears to be at work in the birthday paradox. The lottery paradox is nevertheless quite different from the birthday paradox, particularly in that there is no general consensus as to its resolution. Rather than rejecting the assumption that any given ticket will not win, some would block the paradox by rejecting the idea that it follows from the acceptance of the claim that say ticket #1 will lose and acceptance of the claim that ticket #2 will lose, that it should be accepted that both will lose. Thus the lottery paradox does not belong in the category of the easily resolved. Perhaps more importantly, it differs from other paradoxes involving probability, in that it is not principles of probability or probabilistic relevance that are at stake, but rather the relationship between degrees of belief, as measured by probabilities, and acceptance.
3.3
The Sibling Mystery
There are several questions involving the chances that a person’s sibling is a girl that are easily confused. These are often collectively referred to as the sibling mystery or mysteries, (or the boy-girl paradoxes). The first question asks, given a two-child family in which the older child is a boy, for the probability that his younger sibling is a girl. It is to be assumed that the chance of any given child being a girl is 50%; the facts that the ratio of boys to girls in the population is not exactly 50-50, that the sex of a given child is not entirely independent of that of his or her older siblings, and that some children are intersex, are ignored here. Given these assumptions, the answer to the first question is 1/2. The mystery comes with the second question, which asks for the probability that a boy from a two-child family has a sister. This looks like the first question, but the answer is different. The background assumption of both problems is that a boy comes from a twochild family. The possible orderings of boys (B) and girls (G) in such families are BB, BG, GB, GG By assumption, each is equally probable. The information that the boy is the older of two children rules out the second two possibilities. Since the first two remain equally likely, the probability that the younger sibling is a girl is 1/2. But learning that a boy is from a two-child family rules out only the fourth case. Since in two of the three remaining cases, the boy has a sister, the answer to the second question is 2/3. One reason this may seem puzzling at first is that if such a boy is an older sibling, the probability that he has a sister is 1/2, and if such a boy is a younger sibling, the probability that he has a sister is 1/2, but if he is either the younger or the older of two siblings, the probability that he has a sister is 2/3. the author accepts each of the claims in the book, yet in acknowledging the likelihood that the book contains errors, appears committed to accepting that all the claims in the book are true, and also that they are not.
724
Susan Vineberg
Things change again if it is assumed that one randomly encounters a boy who has one sibling. It may seem that the probability that he has a sister is the same as in the second example above, but this is incorrect. Given the information that the boy is from a two-child family, we know that he is either the older brother of his sister, the older brother of his brother, the younger brother of his sister or the younger brother of his brother. Each case is equally probable, so the probability that he has a sister is 1/2. The key here is that in a random draw of a boy from two-child families, one is twice as likely to get a boy from a family of type BB than from type BG or GB. Such cases are not truly paradoxical. They merely illustrate that care must be taken in specifying the group from which a random selection is made. In the first two cases the draw is over family types, and then some evidence as to which type we have is given. This is perhaps clearer in variants of the first two questions in which a women is asked about her children. She says she has two, which makes it appropriate to consider random draws from the set {BB, BG, GB, GG}. This contrasts with the last case in which the draw is from individuals rather than families. Sometimes the problem begins with a boy saying that he has one sibling, and then the probability that he has a sister is asked for. It is unsurprising when posed in this way that both answers are given, for it is unclear whether the problem should be treated as a random encounter of individuals from two-child families. It is also now not so perplexing that the probability that a boy whose one younger sibling is a girl is 1/2, whereas the probability that a boy from a two-child family has a sister is 2/3. A random draw from {BB, BG, GB} is not the same as a draw from either {BB, BG} or a draw from {BB, GB}. 4
4.1
MORE COMPLEX PROBLEMS OF PROBABILITY
Simpson’s paradox
A perplexing fact about the association of variables in a population, termed Simpson’s paradox, concerns the ability to reverse the association within a subpopulation. In a much discussed example involving applicants to graduate school at the University of California, Berkeley, it was observed that women had a higher rejection rate than men overall, while having a lower rejection rate department by department.8 Although the overall higher rejection rate suggests the possibility of gender discrimination, this is countered by the fact that admission is determined by evaluation at the departmental level, and in each department the rejection rate for women was actually equal to or lower than that for men. There is a simple explanation that accounts for the overall higher rejection rate, despite the lower rate department by department that is supported by the data in this case, namely that women tended to apply to departments with higher rejection rates. The interesting fact is that given an association, it is always possible to reverse the correlation in some sub-population. This is a consequence of the fact that for 8 See
[Bickel et al., 1977].
Paradoxes of Probability
725
whole numbers it is possible to have: a/b < a′ /b′ , c/d < c′ /d′ , but a + c/b + d > a′ + c′ /b′ + d′ For discussion see [Malinas and Bigelow, 2004]. While this is perhaps surprising, once it is understood that this simple mathematical fact lies at the heart of the reversed associations, it is clear that these are not truly paradoxical. However, the possibility of reversing an association of variables by repartitioning raises questions about interpreting correlations. Cartwright [1983] called attention to the paradox, arguing that it indicates that one cannot simply use such correlations to represent causal relations, and that some basic causal relationships are needed to interpret the significance of correlations.
4.2
The Doomsday and Shooting Room Paradoxes
The shooting room paradox [Leslie, 1996] involves a room in which each person or group of people entering encounter an executioner who roles a pair of dice. If he roles double sixes, everyone in the group is killed. If not, they are free to go. It is also the case that at least 9/10 of those entering the room die, which is made possible by the fact that people enter the room in rounds, in increasing numbers, until double sixes are rolled. With probability one, this will occur eventually, and at that time at least ninety percent who have entered in one of the rounds will be in the room during the final, fatal, round. If the executioner rolls a double six on the first round, 100% of those entering the room die, but otherwise the number entering the room increases so that 90% of those entering the room are killed. The paradox arises because assuming that an individual enters the room after the first round, it appears that there are two distinct probabilities that may be attached to the proposition that he will die, namely 1/36, which reflects the chance of double-sixes, and .9, which reflects the percentage of people entering the room who die. However, given the increasing numbers entering the room, the two numbers are compatible. Since it is the role of a double six that determines whether a given individual in the room dies, the chance for an individual of dying is 1/36. A related paradox draws the conclusion that doomsday will likely come soon, from the premise that I am now alive. Leslie [1990] presents the argument roughly as follows: If doom is far in the future, then I have very early birth order among all humans, which is unlikely. However, if doom is sooner, given that there is a fairly high percentage of humans that have ever lived living now, it follows that I am a more typical human. Thus, the evidence that I am alive now favors doom sooner over doom later. The shooting room paradox, in which the largest group of people to enter the room occurs in the final round, is really a kind of special case of the Doomsday
726
Susan Vineberg
paradox. However, the more general doomsday paradox lacks the clear mechanism that allows for a definitive resolution. In a discussion of Leslie’s version of the doomsday argument, Sober [2003] takes it as an argument about likelihoods. The idea is that an observation O favors H1 over H2 just in case O is more probable given H1 than given H2 , i.e. P (O/H1 ) > P (O/H2 ). Assuming H1 and H2 are the hypotheses of doom sooner and later respectively, the claim is that my being alive now favors early doom. However, Sober objects to Leslie’s assumption that P (O/H1 ) > P (O/H2 ) because it is based on a suspect, a priori, argument that involves treating my place in the temporal birth order as a lottery. Sober thinks this is inappropriate because there is no underlying chance process that can be taken to be at work in producing birth order, but more importantly he takes Leslie’s probabilistic claims to be empirically disconfirmed. Bartha and Hitchcock accept that birth order can be modeled as a lottery, however they use Bayes’ theorem to calculate posterior probabilities for H1 and H2 [Bartha and Hitchcock, 1999]. They take there to be an asymmetry in the application of Bayes’ theorem that leads to the favoring of doom sooner. However, they note that the lottery that Leslie appeals to over possible birth orders presupposes that I exist, which makes me special. They argue that in calculating the posterior probability for H1 and H2 , we must first conditionalize on the assumption that I exist at all, and this favors doom later, which they claim cancels out the evidence of doom sooner by dint of my likely place in the birth order. There appear to be various ways of blocking the implausible claim that my existence now makes the demise of the human race more likely to occur sooner rather than later. Although none of the problems here involve assumptions that are so intuitive that we face a particularly gripping paradox, they each raise questions about the appropriate application of probabilistic principles. They highlight the importance of choosing the right reference class in assessing probabilistic relevance and both the shooting room and Simpson’s paradox show the importance of looking to correlations that reflect causally relevant factors. More generally Sober’s analysis of the doomsday paradox highlights the idea that the correlations we employ must reflect the sort of process at work that bears upon the outcome of the event at issue. It is a general problem to formulate principles that determine the relevant probabilities. As the next puzzle indicates, there is considerable controversy over such principles.
4.3
Sleeping Beauty
The Sleeping Beauty problem centers around the following scenario: Beauty will participate in an experiment in which she will be put into a deep sleep on Sunday night. She will be awakened during the experiment either once or twice depending on the toss of a fair coin. In either case she will be awakened on Monday. She will be unaware upon awakening that it is Monday, but subsequently will be told that it is Monday and then given drugs to put her back to sleep. These drugs will erase her memory of having been awake. For definiteness, it is assumed that the coin is
Paradoxes of Probability
727
tossed on Monday night after Beauty is put back to sleep. If the coin lands heads, Beauty will then sleep until Wednesday, at which point the experiment ends. If the coin lands tails, she will be awakened briefly again on Tuesday, unaware of the Monday awakening, and then put back into the deep sleep for the remainder of the experiment. Beauty knows all of this, and will be able to distinguish waking up within the experiment from other awakenings, so that when she awakens during the experiment she will know that it is either Monday or Tuesday. Before being put to sleep, it is presumably reasonable for Beauty to assign a probability of 1/2 to the proposition that the coin lands heads. Now, let us assume that the experiment has begun and Beauty has just awakened from a deep sleep. What should her probability be now for heads? There are two plausible answers, each of which has been defended by multiple authors who offer various arguments for their answers. The disagreement centers over whether waking up within the experiment should be seen as evidentially relevant to heads. Elga, who introduced the problem into the philosophical literature, argues that it is, and that Beauty’s probability for heads upon awakening should be 1/3 [Elga, 2000].9 Elga starts with the assumption that upon awakening, Beauty is in one of the following scenarios: H1 : Heads and it is Monday. T1 : Tails and it is Monday. T2 : Tails and it is Tuesday. Let Ps be Beauty’s probability function before being put to sleep on Sunday, Pa her probability function upon awakening, and Pm her probability function after being told that it is Monday. Elga argues that being in T1 is qualitatively like being in T2 , and hence that Pa (T1 ) = Pa (T2 ). Elga claims that upon learning that it is Monday, Beauty should update her probabilities by conditionalizing on this information, which can be represented as H1 ∨ T1 , and reasons that once Beauty is told that is Monday, she should attach a probability of 1/2 to Heads (after all the coin has not yet been tossed). Thus, Pm (heads)= Pa (heads/(H1 ∨ T1 )) = Pa (H1 /(H1 ∨ T1 )) = 1/2. From this it follows that Pa (H1 ) = Pa (T1 ), and hence that Pa (H1 ) = Pa (T1 ) = Pa (T2 ) = 1/3, and so Pa (heads)= 1/3. Lewis [2001] argues against Elga that since Beauty knew in advance that she would be awakened, waking up within the experiment conveys no new information that is relevant to the probability of heads and so it should remain at 1/2. However, he accepts Elga’s framework and assumptions except for the claims that H1 ∨T1 ∨T2 is relevant to heads and that Pm (heads) = 1/2. From the premise that upon awakening Beauty learns nothing relevant to the probability of heads, Lewis argues that Pa (heads) = 1/2, and paradoxically that Pm (heads) = 2/3. The latter follows from the former claim that Pa (heads)= 1/2, together with the shared assumptions that Pa (heads/(H1 ∨T1 )) = Pa (H1 /(H1 ∨T1 )) = Pa (H1 )/(Pa (H1 )+Pa (T1 )),
9 There is a similar problem in which a driver (or passenger) travels a circular route forgetting whether he has passed through a given intersection [Aumann et al., 1997].
728
Susan Vineberg
and that Pa (tails) = Pa (T1 ) + Pa (T2 ). Dorr [2002] points out that Lewis and Elga’s common assumptions that 1. Pm (heads)= Pa (heads/(H1 ∨ T1 )) 2. Pa (heads/−(H1 ∨ T1 )) = 0 3. 0 < Pa (H1 ∨ T1 ) < 1 together with 4. Pm (H1 ) = 1/2 5. Pa (H1 ) = 1/2 are probabilistically inconsistent. Elga avoids the inconsistency by rejecting (5), whereas Lewis does so by rejecting (4). But this still leaves Lewis with the unintuitive consequence that Pm (heads)= 2/3. Instead, one might give up (1), the assumption that one should update by conditionalization upon learning that it is Monday, as Joel Pust [2008] has suggested. This allows one to hold both that no information relevant to heads is gained when Beauty awakens, but also that learning that it is Monday is not relevant to heads either, which is highly plausible. To be sure, Beauty’s belief shifts over the course of the experiment cannot be characterized exclusively in terms of conditionalization, since she changes her belief from zero to one in certain propositions, such as that it is Sunday. The inadequacy of conditionalization in dealing with beliefs that change due to loss of information and the acquisition of self-locating beliefs demands a more comprehensive treatment of belief change, and some recent attempts to grapple with this have consequences for the Sleeping Beauty problem. Titelbaum [2008] presents a framework that delivers the answer of 1/3, whereas Halpern [2005] presents a system in which Beauty’s degree of belief upon awakening is 1/2. Such work shows that the conflicting answers to the Sleeping Beauty problem are not merely an artifact of the Elga and Lewis’ representations, and may provide some basis for adjudicating the intuitions that have guided a variety of arguments concerning the problem. One of the key intuitions behind the answer of 1/3 concerns the fact that if the experiment were to be repeated, in the long run there be would twice as many tails awakenings as heads awakenings, and this suggests that the answer should be 1/3. However, this answer brings with it various difficulties as well.10 5
PARADOXES OF PROBABILITY AND DECISION
The probability puzzles considered thus far involve applications of probability in various situations, but do not involve utility. Of course the Monty Hall problem 10 See
[White, 2006].
Paradoxes of Probability
729
concerns a decision about whether one should switch doors, but this turns upon just the credences that one should have about where the prize is. Utilities, and the issue of how they combine with credences in making rational decisions, do figure prominently in some problems, including the two envelopes and St. Petersburg paradoxes.
5.1
Newcomb’s Problem
In this problem, an extraordinarily good predictor of behavior offers you a choice between selecting one opaque box and selecting both the opaque box and a transparent box, which you can see contains one thousand dollars. If he predicted that you would select only the opaque box, then he placed one million dollars in the box. If he predicted that you would select both boxes, then he left the opaque box empty. The situation is summed up in the following decision matrix:
Take both boxes Take opaque box only
$1million in opaque box $1,001,000
$0 in opaque box
$1,000,000
0
$1,000
There is some controversy as to the rational choice, but most think that you should take both boxes; the amount of money in the opaque box is fixed, and regardless of what it contains, you will be $1000 richer if you take both boxes. In addition to the question as to the right choice, we can ask about what rational choice theory prescribes. The rational act is usually taken to be the one that maximizes expected value. While the calculation is generally unproblematic, in this case different versions of decision theory offer different answers. According to Jeffrey’s evidential decision theory, one should compute expected value in terms of the conditional probabilities of the outcomes given the acts [Jeffrey, 1983]. Let T be the proposition that both boxes are taken, O that just the opaque box is taken, and M that there is a million dollars in the opaque box. The expected values as computed by Jeffrey’s theory are as follows: EXP (T ) = 1, 001, 000 × P (M/T ) + 1000P (−M/T ) EXP (O) = 1, 000, 000 × P (M/O) + 0 × (−M/O) With P (M/T ) EXP (T ). However, this presents a difficulty for Jeffrey’s evidential decision theory, since this involves choosing a dominated act. Most decision theorists (including Jeffrey) accept that the right decision is to take both boxes, and accordingly some have suggested
730
Susan Vineberg
various alternative ways of computing the expected value of an act, which prescribe taking both boxes in the Newcomb problem. Indeed, the problem has proven to be not so much of a paradox, but a spur that has driven the development of causal decision theory. The basic idea stems from observing that the probability of a state occurring may well be high conditional on some act without the act having causal relevance to whether the state occurs. In this case, the probability of ending up with nothing in the opaque box is very high conditional on choosing both boxes, although the choice is causally irrelevant to whether one million dollars is placed in the box. Causal decision theory solves the problem by replacing the conditional probabilities with causal conditional probabilities in calculating expected value, that is, the conditional probability of a state given an act is replaced by the probability that the act will result in the state in question.11 Considered as a problem of decision theory, Newcomb’s problem does not seem paradoxical given the development of causal decision theory. However, we can regenerate something of a paradox, by reflecting on the one box solution. One can regard this solution as motivated by the idea that choosing one box goes along, at least in the long run, with being the sort of person whom the predictor will recognize as a one boxer and for whom he will place one million in the opaque box. It seems at least somewhat paradoxical that one should choose an act (taking both boxes) that one regards as not being in accordance with the act, which at least in the long run, can be expected to produce greater expected gains. As the two-boxers correctly point out, the fact that it is in one’s interest to be the sort of person who would choose one box does not entail that this is the right decision once the amount of money in the opaque box has been fixed. However, there still appears to be some tension between the general prescription to take both boxes and being the sort of person who picks one box. Of course, those who advocate taking both boxes in the Newcomb problem need not be committed to adopting any sort of general strategy of always taking two boxes, where this would have a causal influence on future plays of the game. Indeed, it is an assumption of the two-box solution that the problem concerns a one-play game, or at any rate that taking both boxes will not have a causal effect on future plays. Even if there is a conflict with assuming that what one does in each play of a Newcomb game is causally irrelevant to subsequent plays, there is seemingly no problem in supposing that taking both boxes in a given play is causally independent of how much money is in the opaque box, and this suffices for the prescription that one should take both boxes.
5.2
The Cable Guy
Another apparent problem for decision theory is discussed by H´ajek [2005]. The cable guy will install your new cable between 8am and 4pm. You are offered the chance to bet that he will come either during the morning interval from (8 to 12] 11 Versions of causal decision theory and the relationships between them are discussed in [Joyce, 1999].
Paradoxes of Probability
731
or during the afternoon from (12 to 4). The morning interval includes 12 noon, and so contains an extra moment, but we can assume that the probability that the cable guy comes right at noon is zero, and so we can take the two intervals to be of the same duration. It seems that you should be indifferent between betting on the morning or afternoon. Assuming that the payoff is the same for each bet, your expected value for the two bets should be equal. However, if you bet on the morning interval, there is certain to be some time at which you will regret having chosen to bet on the morning, since the cable guy is to arrive at some point after 8 am. But then, despite the fact that it seems one should be indifferent between betting on the two intervals, one should bet on the afternoon, since betting on the morning interval violates the following plausible principle of rational decision: Avoid Certain Frustration Principle-Given a choice between two options you should not choose an option for which you are certain that a rational future self will prefer that you had chosen the other, unless both options have this property. [H´ajek, 2005] As stated, the paradox turns on the acceptability of this principle. Although it seems reasonable, H´ ajek takes the cable guy paradox as showing that it is mistaken.12 While the principle may often apply, it does seem that it might be overridden in this case. Closer examination of the principle itself suggests another reason for thinking that it is false. Notice that when there is some time t, however fleeting, for which a future act A will certainly be rationally preferred to B, yet many times at which B will almost certainly be preferred to A, the principle requires choosing A, regardless of the likelihood of generally preferring B to A. Although it seems reasonable to reject the principle to avoid certain frustration, one could also reject the idea that it applies here by suggesting that the certain regret upon which it depends is simply not rational. The cable guy problem is not particularly paradoxical, because its assumptions turn out not to be so plausible after all, although there may still be some question about which of these should be rejected.
5.3
The St. Petersburg Paradox
The St. Petersburg paradox arises from a game in which a fair coin is flipped until it comes up tails. The payoff is 2n , where n is the number of flips. Take a run of the game to be a sequence of n-1 heads followed by tails. Since there are an infinite number of possible runs of the game, one for each natural number n, each having positive probability, the expected value of playing the game is infinite. So, it appears rational to pay any finite sum to play the game. The paradox arises from the fact, that a very low payoff is highly probable. It seems reasonable to be unwilling to pay more than a relatively small amount to play a game with a high probability of yielding a low payout. 12 His
response is seconded and further discussed in [Kierland et al., 2008].
732
Susan Vineberg
One response to the paradox claims that unwillingness to pay more than a small sum to play can be accounted for by appealing to risk aversion.13 One trouble with this is that risk aversion is far from universal. If risk aversion is present, then at least to some extent it can be compensated for by offering higher payoffs. In particular, if a person is only willing to pay a very small sum to enter the game, then classical decision theory assumes that this amount can be increased by a corresponding adjustment to the payoffs [Martin, 2004]. Another response is simply to stipulate that there is an upper bound on the value of the consequences of any act. Jeffrey endorses this noting: “Our rebuttal of the St. Petersburg paradox consists in the remark that anyone who offers to let the agent play the St. Petersburg game is a liar, for he is pretending to have an indefinitely large bank” [Jeffrey, 1983, p. 154]. A related puzzle involves the so-called “Pasadena Game” [Nover and H´ajek, 2004]. As in the St. Petersburg paradox, a fair coin is tossed until it lands heads, but instead the payoffs grow as 2/n and alternate between positive and negative values. This game appears to present a new problem for decision theory. The trouble is that the game’s expectation is a conditionally convergent series and so can be reordered so as to sum to any real number, or to diverge to either positive or negative infinity. For discussion see [Nover and H´ajek, 2004].
5.4
Two Envelopes
In some respects the most paradoxical of the decision theoretic puzzles is the two envelopes (or exchange) problem [Jackson et al., 1994].14 The basic set-up involves two envelopes, with one known to contain twice as much money as the other. You can select one envelope and keep its contents. Suppose it (the envelope on the right) contains $X dollars. Assuming each envelope has an equal chance of containing the greater amount, the expected value of switching envelopes Y is 1/2(X/2)
+ 1/22X = 5/4X
Thus, the basic decision theoretic principle that one should act so as to maximize expected value seems to require switching. Indeed, it suggests that one should be willing to pay a premium to do so. This is paradoxical, since the two envelopes are completely symmetric and so one has no reason to favor one envelope as having the greater amount over the other. Moreover, it seems that one could apply the same reasoning again, so as to justify switching back and forth indefinitely. Additionally, we can imagine a variation in which there are two players, where each is assigned an envelope at random. Each player’s expected value for switching apparently exceeds that of sticking, and yet this is a zero-sum game, and so they cannot both have an advantage in doing so. 13 See
[Weirich, 1984]. 1995; McGrew et al., 1997; Rawling, 1997; Norton, 1998; Horgan, 2000; Fallis,
14 [Broome,
2008].
Paradoxes of Probability
733
Suppose that after choosing you are permitted to look in the contents of your envelope, and that you find $100. The expected value of switching is $125, and so you should switch. There is nothing paradoxical about this. Of course, one might not want to switch due to the decreasing marginal value of money or risk aversion, but putting such considerations aside, switching is the appropriate choice. Here there is no argument for switching back to the first envelope, since sticking now has an expected value of $125, whereas switching back has an expected value of $100. It is now apparent that there is really no opportunity to turn players in the two-envelope game into money pumps. When a player looks into his envelope, he acquires a reason to switch. However, there is no expected gain in switching back, so he will be unwilling to pay to do so. But, if he does not look in his envelope, and it could contain any amount of money, he will not know what amount it would be reasonable to pay to switch. Still, there remains something of a paradox because it seems that one should switch in advance of looking in the envelope, and having done so to switch back. As various authors have pointed out, this requires the assumption that there is no upper bound on the amount of money in the envelopes [Jackson et al., 1994; Broome, 1995; Norton, 1998]. If one had a large sum in one’s envelope, then it would be reasonable to think it very likely the other envelope contains the smaller amount. Thus for particular values of X, the probability of the other envelope containing 2X will not be 1/2. There are clearly many prior distributions for the sum of money in the envelopes such that looking in the right envelope would result in something other than a probability of 1/2 that the other (left) envelope contains twice (half) of what is in the right envelope. Given such a distribution, the paradox dissolves, and it would be reasonable to have just such a distribution in a typical situation in which one is presented with two envelopes and reliably told that one contains twice the other. However, as Broome points out [Broome, 1995], not all prior distributions are like this, i.e there are some distributions that determine probabilities y = 2x and y = x/2, such that the expected value of y on x (switching) is greater than x. This means that we can regenerate the paradox by building in assumptions guaranteeing that the expected value of y on x (switching) is greater than x. There is then a strengthened two envelope paradox in which it is a given that upon opening one of the envelopes you would think that the other is equally likely to contain half or twice as much money [Fallis, 2008]. The stipulation that it is equally likely that the other envelope contains half or twice as much money as the one selected brings with it certain complications. This requires that there is no maximum amount of money that can be placed in the envelopes, but also a uniform distribution over infinitely many states, which cannot be captured by a standard probability function. John Norton [1998] develops a modified version of the paradox that is more mathematically tractable in which the probability distribution fails to be uniform over infinitely many states, but which satisfies the key assumption that there is a positive expectation of gain in switching for any definite amount that is placed in the envelopes. However, as
734
Susan Vineberg
Norton observes, where the amount in the envelope is unknown, the infinite sum of the expectation fails to exist. Thus, for each particular amount, the expectation is finite and so one should switch, but there is no recommendation to switch overall, and thus there is a failure of dominance [Norton, 1998; Rawling, 1997]. The failure of the expectation to converge in the strengthened two envelope paradox shows that it is related to the St. Petersburg and Pasadena games [Broome, 1995; Norton, 1998].
5.5
Other Problems of Decision
There are many other paradoxes involving probability that arise in decision theory. Among the mostly widely discussed are the Prisoner’s dilemma and the Allais paradox. The prisoner’s dilemma involves a two-person game in which the best strategy for each individual is such that when both players act upon it, a nonoptimal outcome results. The Allais paradox involves preferences for options that many people apparently endorse, but that violate the Independence axiom, which requires that one’s preferences for acts depend only on those states in which their consequences differ. While it is called a paradox, it differs markedly from the others considered here. In particular, the conflict does not derive from intuitively plausible assumptions alone, but rather from a clash between these and some empirical findings about what some seemingly rational people prefer. As such it is not a traditional paradox, despite its name. In the prisoner’s dilemma and Allais paradoxes there seems to be little controversy about the probabilistic assumptions involved. Instead these paradoxes raise complex philosophical issues concerning the axioms of rational preference and the principle of utility maximization. Of course, each of the paradoxes in this section involves the concept of expected value, and many of them suggest that some qualification, or refinement, is needed in order to avoid paradox. Many agree though that the remedies involve adjustments, rather than radical changes to decision theory. As noted, one response to the St. Petersburg paradox is to insist that there must be an upper bound on the value of any game. For any actual case that one might encounter, this restriction is unimportant. However, problems such as the prisoner’s dilemma and the Allais paradox have lead some to reject the fundamental principles of rational preference and rational choice. As such they are not merely small puzzles, but the sort of problems that like Russell’s paradox, which led to a rethinking of the foundations of set theory, have prompted the development of diverse theories of rationality.
ACKNOWLEDGEMENTS Thanks to an anonymous referee for helpful comments on an earlier draft of this paper.
Paradoxes of Probability
735
BIBLIOGRAPHY [Aumann et al., 1997] R. J. Aumann, S. Hart, and M. Perry. The Forgetful Passenger, Games and Economic Behavior 20:117-120, 1997. [Bartha and Hitchcock, 1999] P. Bartha and C. Hitchcock. No One Knows the Date or the Hour: An Unorthadox Application of Rev. Bayes’s Theorem, Philosophy of Science 66 (3):S339-S353, 1999. [Bickel et al., 1977] P. J. Bickel, E. A. Hammel, and J. W. O’Connell. Sex Bias in Graduate Admissions: Data from Berkeley, in William B. Fairley and Frederick Mosteller (eds.), Statistics and Public Policy, Reading: Addison-Wesley, 1977. [Broome, 1995] J. Broome. The Two-envelope Paradox, Analysis 55 (1):6-11, 1995. [Cartwright, 1983] N. Cartwright. How the Laws of Physics Lie. Oxford: Oxford University Press, 1983. [Clark, 2002] M. Clark. Paradoxes from A to Z. New York: Routledge, 2002. [Dorr, 2002] C. Dorr. Sleeping Beauty: in defense of Elga, Analysis 62 (4):292-296, 2002. [Elga, 2000] A. Elga. Self-locating Belief and the Sleeping Beauty Problem, Analysis 60 (2):143147, 2000. [Fallis, 2008] D. Fallis. Resolving the Strengthened Two Envelope Paradox, presented at the American Philosophical Association, Central Division Meeting, Chicago. 2008. [H´ ajek, 2005] A. H´ ajek. The Cable Guy Paradox, Analysis 65:112-119, 2005. [H´ ajek, 2007] A. H´ ajek. Interpretations of Probability (Winter 2007). http://plato.stanford. edu/archives/win2007/entries/probability-interpret/ [Halpern, 2005] J. Y. Halpern. Sleeping Beauty Reconsidered: Conditioning and Reflection in Asynchronous Systems, in T. S. Gendler and J. Hawthorne (eds.), Oxford Studies in Epistemology Oxford University Press, 111-142, 2005. [Horgan, 2000] T. Horgan. The Two-Envelope Paradox, Nonstandard Expected Utility, and the Intensionality of Probability, Nous 34:578-603, 2000. [Hunter and Madachy, 1975] H. A. H. Hunter and J. H. Madachy. Mathematical Diversions. Toronto: Dover, 1975. [Jackson et al., 1994] F. Jackson, P. Menzies, and G. Oppy. The Two Envelope Paradox, Analysis 54:43-45, 1994. [Jaynes, 1973] E. T. Jaynes. The Well-Posed Problem , Foundations of Physics 3:477-492, 1973. [Jeffrey, 1983] R. C. Jeffrey. The Logic of Decision. Second ed. Chicago: The University of Chicago Press, 1983. [Joyce, 1999] J. M. Joyce. The Foundations of Causal Decision Theory, Cambridge Studies in Probability, Induction and Decision Theory. Cambridge: Cambridge University Press, 1999. [Kierland et al., 2008] B. Kierland, B. Monton, and S. Ruhmkorff. Avoiding Certain Frustration, Reflection, and the Cable Guy Paradox, Philosophical Studies 138:317-333, 2008. [Kyburg, 1970] H. E. Kyburg, Jr. Probability and Inductive Logic. Toronto: The Macmillan Company, 1970. [Leslie, 1990] J. Leslie. Is the End of the World Nigh, The Philosophical Quarterly 40 (158):6572. 1990. [Leslie, 1996] J. Leslie. The End of the World. London: Routledge, 1996. [Lewis, 2001] D. Lewis. Sleeping Beauty: reply to Elga, Analysis 61 (3):171-176, 2001. [Malinas and Bigelow, 2004] G. Malinas and J. Bigelow. Simpson’s Paradox (Spring 2004). http://plato.stanford.edu/archives/spr2004/entries/paradox-simpson/ [Martin, 2004] R. Martin. The St. Petersburg Paradox (2004), E. Zalta, ed. http://plato. stanford.edu/archives/fall2004/entries/paradox-stpetersburg/ [McGrew et al., 1997] T. J. McGrew, D. Shier, and H. S. Silverstein. The Two-Envelope Paradox Resolved, Analysis 57 (1):28-33, 1997. [Norton, 1998] J. D. Norton. When the Sum of Our Expectations Fails Us: The Exchange Paradox, Pacific Philosophical Quarterly 79:34-58. 1998. [Nover and H´ ajek, 2004] H. Nover and A. H´ ajek. Vexing Expectations, Mind 113:237-249, 2004. [Pust, 2008] J. Pust. Sleeping Beauty, Conditionalization, and Knowledge De Praesenti, presented at the American Philosophical Association, Pacific Division Meeting, Pasadena, CA, 2008. [Rawling, 1997] P. Rawling. Perspectives on a Pair of Envelopes, Theory and Decision 43:253277, 1997.
736
Susan Vineberg
[Sainsbury, 1988] R. M. Sainsbury. Paradoxes. Cambridge: Cambridge University Press, 1988. [Sober, 2003] E. Sober. An Empirical Critique of Two Versions of the Doomesday ArgumentGott’s Line and Leslie’s Wedge, Synthese 135:415-430, 2003. [Titelbaum, 2008] M. G. Titelbaum. The Relevance of Self-Locating Beliefs, Philosophical Review, 2008. [Tversky and Kahneman, 1974] A, Tversky and D. Kahneman. Judgment under Uncertainty: Heuristics and Biases, 185:1124-1131, 1974. [van Fraassen, 1989] B. C. van Fraassen. Laws and Symmetry. New York: Oxford University Press, 1989. [Weatherford, 1982] R. Weatherford. Philosophical Foundations of Probability Theory. London: Routledge and Kegan Paul, 1982. [Weirich, 1984] P. Weirich. The St. Petersburg Gamble and Risk, Theory and Decision 17:193202, 1984. [White, 2006] R. White. The Generalized Sleeping Beauty Problem: A Challenge for Thirders, Analysis 66:114-119, 2006.
STATISTICAL PARADOXES: TAKE IT TO THE LIMIT
C. Andy Tsao
1
INTRODUCTION
There are two kinds of paradoxes. The first kind arises from confusion. People think there is paradox but actually there is none once the confusion is clarified. The second kind arises from a conflict among principles we find compelling. There is a fundamental tension in our thinking brought out by these paradoxes. We can not simply resolve them by deepening our understanding. Rather, we have to recognize an uncomfortable truth. Naturally, the paradoxes of the second kind are more interesting. Nonetheless, the taxonomy of paradoxes are usually as difficult as the paradoxes themselves. In this article, we will discuss two statistical paradoxes: Lindley’s paradox and the Fieller-Creasy Problem — see [Lindley, 1957] and [Fieller, 1954] respectively. Both arose in the early 60’s yet they are not fully resolved so far as we know. They are paradoxes of limits. We think it is worthwhile to note that: mathematics may not be as rigorous as we hope it to be — it can be quite tricky, especially when we push it to the limit. Our statistical paradoxes can be illustrated by the following example. Consider the length of Cleopatra’s nose as θC (cm). We are interested in knowing: • What is θC ? • What is the possible range of θC ? These can be answered by the statistical theory of point estimation and confidence intervals. Moreover, we might also like to know 1. Is θC = 15(cm)? 2. Is θC equal to θE , the nose length of Elizabeth Taylor? Formally • Is D = θE − θC = 0? • Is ρ = θE /θC = 1?
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
738
C. Andy Tsao
These are typical problems of hypothesis testing. The first is a question for the (population) mean of one population while the second is one for the comparison of means from two populations. Note that there are at least two ways of representing the equality, the difference of zero or the ratio of 1. In classical (or frequentist) statistics, θC is considered as a parameter, a fixed yet unknown quantity. We do not know what the true value of θC is; hence resort to the observation. Suppose we can measure Cleopatra’s nose n times in an independent manner. The commonly assumed model is that X1 , X2 , · · · , Xn are independently identically distributed as a normal random variable with mean θC and variance σ 2 > 0, abbreviated as X1 , · · · , Xn ∼iid N (θC , σ 2 ). Similarly, 2 we assume Y1 , · · · , Ym ∼iid N (θE , σE ), observations of the length of nose of Ms. Taylor. Lindley’s paradox occurs in hypothesis testing problems where the null hypothesis is just one value (point null hypothesis). For example, for testing θ = 15 versus θ 6= 15, Lindley’s paradox suggests that the p-value is more liberal than its Bayesian contenders. In other words, based on the same observation, the classical statistical evidence that the nose length is 15 cm differs from those Bayesian evidences (posterior probabilities with respect to some priors). Moreover, it contradicts the common belief that the classical statistical procedures are objective and more conservative than the Bayesian procedures. In the context of hypothesis testing, conservative tests are harder to reject the null hypothesis than their other contenders. The Fieller-Creasy problem highlights the problems about formulation and how a sound classical/frequentist statistical procedure can have problematic properties. First, the problem formulation of noses of equal lengths affects choices of the suitable statistical procedures: equality represented by D = 0 or by ρ = 1 will lead to different tests and, consequently, different conclusions. In addition, the confidence intervals for the ratio are quite problematic. [Gleser and Hwang, 1987] shows that for a 95% confidence interval for ρ, the ratio of nose lengths, has a positive probability being the entire parameter space (−∞, ∞) or sets excluding a neighborhood of 0. Lindley’s paradox and the Fieller-Creasy problem have been widely discussed ever since their discovery. Our paper is not intended to be an exhaustive review. Instead, we hope to highlight some interesting perspectives on the paradoxes and, hopefully, remind philosophers, statisticians and users of statistics of these uncomfortable facets of statistics. 2
FREQUENTIST VS. BAYESIAN
Lindley’s paradox and the Fieller-Creasy problem are important illustrations of the Frequentist-Bayesian discrepancy. The discrepancy starts with the different interpretations of probability. Explicitly, frequentists and Bayesians treat the uncertainty about the parameter and the randomness of the sampling data in different ways. These differences in turn affect the choices of statistical procedures and the guarantee of these procedures.
Statistical Paradoxes: Take It to The Limit
739
For easy exposition, we will assume the normality. Let X1 , · · · , Xn ∼iid N (θ, σ 2 ), random sample from a normal probability density function (pdf), f (x|θ), with unknown parameter θ ∈ Θ = (−∞, ∞) and known σ 2 > 0. Typical statistical problem formulations are point estimation (estimate what θ is), confidence interval (an interval estimate of θ) and hypotheses testing (which hypothesis about θ is more likely). The frequentist considers the parameter θ a unknown yet fixed quantity and X1 , · · · , Xn as one random sample (of size n) from one of the possible infinite experiments. The commonly recommended estimator of θ ¯ = 1 (X1 + · · · + Xn ). X n
(1)
A (1 − α) confidence interval of θ is ¯ − zα/2 σn , X ¯ + zα/2 σn ] C(X) = [X
(2)
where zα such that P (Z < zα ) = 1 − α, σn2 = σ 2 /n and Z is a N (0, 1) random variable. For testing H0 : θ = θ 0
vs.
H0 : θ 6= θ0
(3)
The recommended α level test rejects H0 when ¯ − θ0 | |X > zα/2 . σn In addition for testing (3), the p-value is defined as ¯ |¯ x − θ0 | |¯ x − θ0 | |X − θ0 | > = P |Z + ∆| > p(x) = P σn σn σn
(4)
(5)
0 where ∆ = θ−θ σn . These frequentist statistical procedures are “recommended” because they have good frequentist properties. To assess whether (1) estimate θ well, (2) covers θ or (4) makes accurate decision, one needs to know the true value of θ. However, because the parameter is unknown, there is no way to verify how these procedures (for point estimation, etc) perform with respect to the current observed data x1 , · · · , xn . The notational difference is worth noting: X ′ s refer to the random observation while x′ s refer to the observed value of X ′ s in the given experiment. To discuss the guarantees or properties, frequentists resort to the long-run average performances, for example, unbiasedness in point estimation, confidence coefficient (minimum coverage probability) and significance level (maximum of probability of Type I error). All of them interpret probability as long-run frequency via Law of Large Number (LLN)
740
C. Andy Tsao
K 1 X (k) X K→∞ K
E(X) = lim
(6)
k=1
where X (k) , an iid sample of f (x|θ) from the k−th experiment. The probability of event A can be viewed as the expectation of 1[X∈A] , the indicator function of A, hence K 1 X 1[X (k) ∈A] . K→∞ K
P (A) = E1[X∈A] = lim
(7)
k=1
Similarly K 1 X ¯(k) X K→∞ K
¯ = lim E(X)
(8)
k=1
where X¯(k) = n1 (X1 + · · · + Xn ), an iid sample of size n from the k-th experi¯ will hit the true θ when averaging over ment. It guarantees that the estimator X applications in the long-run. The confidence interval (2) has (1 − α) confidence coefficient, that is, (k)
(k)
min Pθ (θ ∈ C(X)) = 1 − α. θ∈Θ
(9)
This indicates that C(X) will cover the true θ about (1 − α)100% in the long-run applications in repeated experiments. The α level test (4) guarantees that max Pθ ( Reject H0 ) = α.
θ∈Θ0
(10)
Define Θ as the parameter space (collection of all possible values of θ) and Θ0 as the null space (the collection of all θ such that the null hypothesis is true). Recall that all of the probabilities and the expection are interpreted as long run frequencies in the frequentist inference framework. Moreover, we would like to point out that the minimum and maximum (or, more rigorously, infimum and supremum) in the definitions of (9) and (10) reflect the guarantee under the worst scenario. In this sense, classical/frequentist statistical procedures are built to be conservative. Bayesians formulate the uncertainty of the parameter θ through a prior π(θ). Under our normal setting, one of the mathematically convenient priors is the conjugate prior π(θ) ∼ N (µ, τ 2 ) where µ is a real number and τ 2 > 0. Upon observing x = (x1 , · · · , xn )′ , the posterior pdf of θ can be derived by Bayes Theorem π(θ|x) = R
π(θ)f (x|θ) ∼ N (µx , ρ−1 ) π(θ)f (x|θ)dθ θ∈Θ
Statistical Paradoxes: Take It to The Limit
741
where 1 µ σ2 µ + τ 2 x ¯ x ¯ = n2 + and 2 2 ρ τ σn σn + τ 2 1 σ2 + τ 2 1 ρ = 2 + 2 = n2 2 . τ σn σn τ
µx =
(11) (12)
where Θ = (−∞, ∞), the parameter space. For most Bayesians, the posterior distribution contains all the information needed for statistical inferences. Under the squared error loss, the Bayes estimate of θ with respect to π(θ), is the posterior mean Eπ(θ|x) θ = µx . The expression of µx in (11) can be viewed as a weighted average of sample mean x ¯ and prior mean µ. In addition, for very large sample size n, µx is essentially x ¯. On the other hand, if τ 2 goes to infinity, µx becomes x ¯; τ 2 goes to zero µx becomes µ. In terms of estimating θ under the current normal-normal setting, the Bayes point estimate is µx and the frequentist point estimate is x¯. This is a perfect illustration of widely held intuition/belief: as the (prior) information diffuses or a “non-informative” prior is used, the Bayes inference coincides with the frequentist inference; Bayesian inference is similar to the frequentist one when sample size is very large. The Bayes interval estimate analogous to the frequentist confidence interval for θ is the Bayes credible set. We say Cπ (x) is a (1 − α) credible set for θ if Pπ(θ|x) (θ ∈ Cπ (x)) = 1 − α.
(13)
To some Bayesians, it is of little point to consider hypothesis testing problems. They argue that the posterior distribution provides better and complete picture regarding θ. Nonetheless, others develop tests in the Bayesian framework. Roughly speaking, Bayesian tests are based on posterior probability of Θ0 and Θ1 , Pπ(θ|x) (θ ∈ Θ0 ) and Pπ(θ|x) (θ ∈ Θ1 )
(14)
through their ratio (posterior odd) or Bayes factor. Typically, if Pπ(θ|x) (θ ∈ Θ1 ) is (much) larger than Pπ(θ|x) (θ ∈ Θ0 ), Bayesians will reject H0 . In the simple versus simple hypothesis testing problem, Bayesian testing procedures again are similar to frequentist testing procedures: Baysian tests depend on the a weighted likelihood ratio and the frequentist one depends on likelihood ratio. 3
LINDLEY’S PARADOX
Hypotheses testing is a well-established problem formulation of statistics. It is widely used in many applied fields and p-values are implemented in almost all statistical software as standard outputs. Casual users of statistics might think of the hypothesis tests as synonymous with p-values. This seems quite natural for them to have this impression. Standard statistical textbooks claim that the p-value is the data-dependent evidence against the null hypothesis. Statistical
742
C. Andy Tsao
packages include p-values rather than specifying the significance level. Academic journal papers of most fields lavishly quote p-values as statistical support for their scientific discovery. It is common practice that comparing p-values with 0.05 for “significant” scientific results. Lindley’s paradox can be viewed as a paradox about p-values. It casts a shadow on the practice of hypotheses testing. Namely, • P-value is NOT conservative statistical evidence (against the null hypothesis) compared with its Bayes contenders (with respect to some reasonable family of priors) • Frequentist and Bayesian statistical measures do not necessarily coincide with each other even for diffuse priors Specifically, let X1 , · · · , Xn ∼iid N (θ, σ 2 ) with unknown θ ∈ Θ = R and known σ 2 > 0. Consider the hypothesis testing problems: H0 : θ = 0
vs.
H1 : θ 6= 0.
(15)
Note that so far as two-sided hypotheses testing is concerned, without loss of generality, any point null hypothesis θ = θ0 can be translated to θ = 0. It is well recognized that this formulation might be unrealistic but nonetheless provides great simplification. In addition, the assignment of prior probability on 0 can be problematic for Bayesians. The more appropriate imprecise hypothesis testing problem has been proposed. H0ǫ : |θ| ≤ ǫ
vs.
H1ǫ : |θ| > ǫ.
(16)
for some ǫ > 0. One rationale is that |θ| < ǫ are practically zero in applications for suitable ǫ. Some Bayesian approaches have been taken for (16), for example, [Verdinelli and Wasserman, 1996], [Delampady, 1989] and [G´omez-Villegas and S´ anchez-Manzano, 1992]. Nonetheless, (15) has been suggested as a reasonable approximation to (16) for practical ǫ choices. See, for example, [Berger and Delampady, 1987] for arguments and elaboration on this point. For testing (15), the recommended and practiced test is the uniformly most powerful unbiased α test: At level α Rejecting H0
¯ n > zα/2 if |X|/σ
(17)
where Z ∼ N (0, 1) and zα/2 is the upper α/2 cutoff point of standard normal such that P (Z > zα/2 ) = α/2 and σn2 = σ 2 /n. And the corresponded p-value is ¯ > |¯ p(¯ x) = P0 (|X| x|) = 2P (|Z| >
|¯ x| ) σn
(18)
P Pn ¯ = 1 n Xi and its observed (realized) sample version x ¯ = n1 i=1 xi . where X i=1 n Lindley’s paradox states that for large n, there exists moderate x ¯ such that
Statistical Paradoxes: Take It to The Limit
743
• The p-value is less than a small α, say 0.05. • The posterior probability of θ = 0 can be greater than 1 − α. With suitable choice of prior which assigns π0 = Pπ (θ = 0) and distributes the rest of the prior probability over the alternative space through diffuse density. For example, when the diffuse prior is chosen as a normal pdf with mean µ and large variance τ 2 , ( π0 if θ = 0, (19) π(θ) = (1−π0 ) 2 2 √ exp [−(θ − µ) /τ ] otherwise. 2πτ 2 A natural Bayes estimate of θ = 0 is Pπ(θ|¯x) (θ = 0) = Eπ(θ|¯x) 1[θ=0]
(20)
where 1[θ=0] is the indicator function which equals 1 if θ = 0, otherwise equals 0. Note that (20) is the posterior expectation of 1[θ=0] and the Bayes estimate with respect to squared error loss. For large n, moderate |¯ x| will render large (greater than 1 − α) Pπ(θ|¯x) (θ = 0) and small (less than α) p-value. This is the paradox. A paradox epitomizes the conflict between Bayesian and frequentist evidences of assessing whether θ = 0. This paradox has deep implications and causes concern about using hypothesis testing. See [Lindley, 1957], [Schafer, 1982] and references therein. Also [Berger and Delampady, 1987], [Berger and Sellke, 1987], [Casella and Berger, 1987], and more recently, [Tsao, 2006a] and [Tsao, 2006b] are studies motivated by this perplexity. Is Lindley’s paradox a paradox of confusion or a paradox of confliction? Here we focus on some assumptions that underpin Lindley’s paradox 1. The point null hypothesis is a reasonable approximation to the more realistic interval null hypothesis. 2. Singularity of the indicator function, 1[θ=0] , does not create artificial irregularity which in turn yields the paradox. 3. It is justifiable to compare Eπ(θ|¯x) 1[θ=0] with the p-value as Bayesian and frequentist evidences respectively for assessing possibility of θ = 0. 4. Assignment of π0 = Pπ (θ = 0) does not seriously affect the results. The mathematical reduction through limit at 0 might create some irregularities. Note that these forementioned assumptions are not mathematical conditions such that the Lindley’s paradox will hold, they are underlying the arguments and proofs. For example, while the point null formulation (15) simplifies the (frequentist) computation of the distribution of the test statistic under the null space, it makes the Bayesian computation much harder. Inevitably, it induces the following questions:
744
C. Andy Tsao
What is a reasonable π0 to be assigned? Robustness of this assignment? Estimation of an indicator function takes only value 0 except one point is another difficult task. Will the paradox still hold if we replace it by a smoother version of the indicator function? Recently, [Tsao, 2006a] uses β(θ) = 2Φ(
−|θ| √ ), σ/ m
(21)
with m being a positive integer, as a smooth version of the indicator function 1[θ=0] and note that the p-value as a maximum likelihood estimate for (21). Under this smooth null approach, it is shown that although the maximum likelihood estimate is more extreme than their Bayes contenders, the discrepancy is much less serious than that in the Lindley’s paradox. It suggests that although the maximum likelihood estimate tends to be more extreme than Bayes estimates with respect to some “reasonable” priors, the marked difference between them might be due to the singularity of the problem. The singularity, θ = 0, is artificially chosen for mathematical convenience. There are certainly other formulations for near zero θ values from application perspectives. In light of this observation, we think p-value or a maximum likelihood estimate do have different behaviours but the discrepancy noted in Lindley’s paradox might be mild if we replace the limiting singularity by some smooth version of formulation. For more discussion on these assumptions, please refer to [Tsao, 2006a]. Lindley’s paradox is often quoted as a example of irreconcilability between Classical/Frequentism and Bayesianism. However, p-value itself is controversial statistical evidence. Its validity is also under active research. A line of research, for example [Berger and Delampady, 1987], [Berger and Sellke, 1987], [Casella and Berger, 1987] and [Hwang et al., 1992], studies the validity of p-values under a robust Bayesian and a formal decision theoretical framework. Unfortunately, so far as we know, there seems to be no consensus. Therefore, we think Lindley’s paradox is better understood as a warning about the p-value, the common practice, as the evidence against null hypothesis since the p-value tends to be more liberal (i.e. smaller) than Bayes estimates when the data is moderately significant. Before closing this section, we would like to point out that a confidence interval can be a very informative supplement to the p-value in hypothesis testing. It helps us to see how the statistical significance corresponds to practical significance in the given context. Suppose that the nose length of Cleopatra’s is measured as 15.0001 cm based on the average of many independent random sample. Can we say the true nose length θC = 15? Practically? Yes; statistically? Maybe not. The difference 0.0001 might not be practically different but can be statistical significant when the sample size is large or the variance is small. P-value captures the statistical significance but it should be carefully examined in the context of the problem. A remedy to this is to look at the (two-sided) confidence interval of θ at the same time. By the duality of test and confidence set, see for example Theorem 9.2.1 of [Casella and Berger, 1990], the confidence interval can also be
Statistical Paradoxes: Take It to The Limit
745
used for hypotheses testing. Specifically, in our context, Rejecting
¯ n > zα/2 H0 if |X|/σ
if and only if 0 does not belong to C(X). (22)
¯ ± zα/2 σn , a (1 − α) confidence interval for θ. The test statistic where C(X) = X ¯ ¯ −0). It measures the difference in the (X −0)/σn is a standardized difference of (X stand normal scale in the sense that the test statistic is distributed as a standard normal random variable. In our problem, this scale is σn . If the scale is too small for practical consideration, it will be immediately noticed through the confidence interval. For example, a 95% confidence interval of [15.00005, 15.00015] will lead to a very small p-value yet this difference makes no practical importance — the nose length is essentially 15 cm. 4
FIELLER-CREASY PROBLEM
When we say Cleopatra and Elizabeth Taylor have the same nose length, we mean D = θE − θC = 0 or ρ = θE /θC = 1. Mathematically, they are equivalent statements. We tend to think either problem formulation is fine and the statistical procedures will provide consistent answers. Unfortunately, it is not so. The difference formulation is widely applied. The statistical procedures, point estimation, confidence intervals and hypotheses testings, are similar to the one-population problems and well-studied in statistical theory. On the other hand, the ratio formulation generates important differences in the corresponding statistical procedures such as the confidence intervals. It is noted that the 95% Fieller’s confidence interval, the popular confidence interval for the ratio ρ, has a positive probability of having infinity length. In the simplest setting, let X ∼ N (θ1 , σ 2 ) and Y ∼ N (θ2 , σ 2 ), the Fieller’s confidence set for ρ = θ1 /θ2 , where θ2 = 6 0, is |X − Y ρ| (23) CF (X) = {ρ : p < zα/2 } 1 + ρ2
where X = (X, Y )′ . The ratio parameters are of importance in the field of biology and bioassay. See, for example, [Finney, 1978], [Govindarajulu, 1988] and [Chow and Liu, 1992]. Proposed by [Fieller, 1954], the Fieller’s confidence set, denoted by CF (X), is a popular confidence set for ratios, denoted by ρ. Despite its popularity, it is one of the most well-known embarrassments to frequentist confidence theory; namely, with a positive probability the set CF (X), as a 1 − α confidence set, can be the whole parameter space. Nonetheless, it is not a trivially bad set estimator. The justification is well-founded from pivotal quantity, likelihood ratio test and profile likelihood arguments. Furthermore, the possibility of infinite length of CF (X) can not be avoided. In fact, it was proved in [Gleser and Hwang, 1987] that any confidence interval
746
C. Andy Tsao
for some errors-in-variables models having almost surely finite length has zero confidence coefficient, the infimum of the coverage probability over the parameter space. These models include the normal model we assumed with the ratio as the parameter. A remedy is to provide a data-dependent confidence reported by estimated confidence approach. This is, however, beyond the scope of current paper. The interested readers are referred to [Kiefer, 1977] and [Berger, 1988]. [Fu, 1995] suggests that the problem with the ratio is due to the singularity induced by taking the limit of the denominator parameter to 0. The ratio shoots to positive or negative infinity. This is yet another example when mathematics takes to the limit, there emerges some irregularity. However, there is something more fundamental: a statistical procedure may not have good pre-data (frequentist) and post-data (Bayesian) performances at the same time. Here we refer to pre-data performance as the long-run frequentist properties such as confidence coefficient (infimum of the coverage probability), accuracy and expected length; the post-data performance refers to the properties such as the posterior coverage probability, the (realized) length, where the data is considered given. We think the paradox of Fieller-Creasy problem arises from confusion and confliction about the Bayes credible set and the frequentist confidence interval. While confidence intervals are taught as standard set procedures in statistics textbooks, users of statistics often pay less attention in the interpretation of confidence intervals. For example, the confidence coefficient 95% for a 95% confidence interval refers to the minimum coverage probability of the confidence interval as a random procedure and the probability is interpreted through the long run frequency. Moreover, these realized confidence intervals sometimes are mistaken as 95% Bayes credible sets. Since many classical confidence intervals have constant coverage probability over the parameter space, they seldom correspond to Bayes credible set with respect to proper priors. In light of this observation, we think it is natural or at least anticipated that the confidence interval for the ratio to have strange post-data behaviours. The interested readers are referred to, for example, [Tsao, 1998] for poor properties for conditional implementation of Fieller’s confidence sets. 5
CONCLUDING REMARK
Lindley’s paradox and the Fieller-Creasy problem are often quoted as the contradiction or inconsistency between Frequentist and the Bayesian inferences. Frequentist approach focuses on the pre-data aspects of the statistical inference while Bayesian approach concentrate on post-data perspectives. We acknowledge the difference between these two paradigms of statistics. However, the discrepancy is not necessarily marked in practice. For example, [Mukhopadhyay and DasGupta, 1997] shows that it is possible to construct (1 − α) HPD Bayes credible sets with minimum coverage (frequentist) coverage probability uniformly close to (1 − α) as well. The drastic discrepancy, in our opinion, is induced by the singularity or irregularity in the formulation or the setting of the approach. In the Lindley’s
Statistical Paradoxes: Take It to The Limit
747
paradox, the estimation of the indicator function of a point null; in the FiellerCreasy problem, the ratio parameter when the denominator approaches to 0. Recall the “belief/intuition” we mentioned in Section 2, when a diffuse prior or a “non-informative” prior is used, the Bayes inference coincides with the frequentist inference; the Bayesian inference is similar to the frequentist one when sample size is very large. Note that both diffuse prior or large sample size correspond to the cases when the prior variance or the sample size go to infinity. They are the limiting situations. We think there are fundamental differences between frequentist and Bayesian approaches. They do not necessarily coincided or are reconcilable with each other. Within one domain, frequentist or Bayesian alike, the problem of limits remains but the interpretations and guarantees are clear. However, if we pass from one domain into the other, every problem needs to be proceeded with caution. In mathematics, the existence of a solution of an equation often depends on what the space/set in which we are looking for the solution. For example, for x2 + 1 = 0 there exists no real root but there are roots in the complex space. As in the fable of the blind and the elephant, maybe we are not wise enough to see the whole picture and the paradox remains unresolved. Yet if we can humbly assume there are some animals unknown to us, it is hopeful that sometime in the future our knowledge domain will be broaden to lift these contradictions. Are these statistical paradoxes of confusion or of confliction? Well, we will leave the taxonomy of these paradoxes to our wise readers. ACKNOWLEDGMENTS The author would like to thank two anonymous referees for their careful reading and helpful comments. The research is supported by NSC 97-2118-M-259-003MY2, Taiwan. BIBLIOGRAPHY [Berger, 1985] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Second Edition, Springer-Verlag, New York, 1985. [Berger, 1988] J. O. Berger. An alternative: The estimated confidence approach. In Statistical Decision Theory and Related Topics IV, Vol. 1 (S. S. Gupta and J. O. Berger, eds.). 85–90. Springer-Verlag, New York, 1988. [Berger and Delampady, 1987] J. O. Berger and M. Delampady. Testing precise hypothesis (with discussion). Statistical Science 2, 317-352, 1987. [Berger and Sellke, 1987] J. O. Berger and T. Sellke. Testing a point null hypothesis: the irreconcilability of p-values and evidence. Journal of American Statistical Association 82, 112–122, 1987. [Casella and Berger, 1987] G. Casella and R. L. Berger. Reconciling Bayesian and Frequentist evidence in the one-sided testing problem. Journal of American Statistical Association 82, 106–111. 1987 . [Casella and Berger, 1990] G. Casella and R. L. Berger. Statistical Inference. Brooks/Cole, Thomson Information/Publishing Group, 1990.
748
C. Andy Tsao
[Chow and Liu, 1992] S. C. Chow and J. P. Liu. Design and analysis of bioavailability and bioequivalence studies. Marvel Decker, Inc. New York, 1992. [Delampady, 1989] M. Delampady. Lower bounds on Bayes factors for interval null hypotheses. Journal of American Statistical Association 84, 120–124, 1989. [Fieller, 1954] E. C. Fieller. Some problems connected in interval estimation. Journal of Royal Statistical Society Series B 16, 175–183, 1954. [Finney, 1978] D. J. Finney. Statistical methods in biological assay, 3rd ed., Charles Griffin and Company Ltd, 1978. [Fu, 1995] J. Fu. Personal communication, 1995. [Govindarajulu, 1988] Z. Govindarajulu. Statistical Techniques in Bioassay. Karger, Switzerland, 1988. [Gleser and Hwang, 1987] L. Gleser and J. T. Hwang. The nonexistence of 100(1-α)% confidence set of finite expected diameter in error in variable and related models. Annals of Statistics15, 1351–1362, 1987. [G´ omez-Villegas and S´ anchez-Manzano, 1992] M. A. G´ omez-Villegas and G. E. S´ anchezManzano. Bayes factors in testing precise hypotheses. Communications in Statistics: Theory and Methods 21, 1707–1715, 1992. [Hwang et al., 1992] J. T. G. Hwang, G. Casella, C. Robert, M. Wells and R. Farrel. Estimation of accuracy in testing. Annals of Statistics 20, 490–509, 1992. [Kiefer, 1977] J. Kiefer. Conditional confidence and confidence estimators. (with discussion), Journal of American Statistical Association72, 789–827, 1977. [Lehmann, 1986] E. L. Lehmann. Testing Statistical Hypotheses, 2nd Ed. Wiley, New York, 1986. [Lindley, 1957] D. V. Lindley. A statistical paradox. Biometrika 44, 187–192, 1957. [Mukhopadhyay and DasGupta, 1997] S. Mukhopadhyay and A. DasGupta. Uniform Approximation of Bayes Solutions and Posteriors : Frequentistly Valid Bayes Inference Statistics and Decisions 15, 51-73, 1997. [Schafer, 1982] G. Schafer. Lindley’s paradox. Journal of American Statistical Association 77, 325–334, 1982. [Tsao, 2006a] C. A. Tsao. A note on Lindley’s paradox. Test 15, 125–139, 2006. [Tsao, 2006b] C. A. Tsao. Assessing post-data weight of evidence. Journal of Statistical Planning and Inference 136, 4012–4025, 2006. [Tsao, 1998] C. A. Tsao. Conditional coverage probability of confidence intervals of errors-invariables and related models. Statistics and Probability Letters 40, 165–170, 198. [Verdinelli and Wasserman, 1996] I. Verdinelli and L. Wasserman. Bayes factors, nuisance parameters and imprecise tests. Bayesian Statistics 5 (Bernardo, J.M., Berger, J.O., Dawid, A.P. and Smith, A.F.M. eds. ), 765–771. Oxford University Press, 1996.
Part VIII
Statistics and Inductive Inference
This page intentionally left blank
STATISTICS AS INDUCTIVE INFERENCE Jan-Willem Romeijn
1
STATISTICAL PROCEDURES AS INDUCTIVE LOGICS
An inductive logic is a system of inference that describes the relation between propositions on data, and propositions that extend beyond the data, such as predictions over future data, and general conclusions on all possible data. Statistics, on the other hand, is a mathematical discipline that describes procedures for deriving results about a population from sample data. These results include predictions on future samples, decisions on rejecting or accepting a hypothesis about the population, the determination of probability assignments over such hypotheses, the selection of a statistical model for studying the population, and so on. Both inductive logic and statistics are calculi for getting from the given data to propositions or results that transcend the data. This suggests that there is a strong parallel between statistics and inductive logic. In fact, it does not take much imagination to view statistical procedures as inferences: the input components, primarily the data, are the premises, and the result of the procedure is the conclusion. In this rough and ready way, statistical procedures can be understood as defining particular inductive logics. However, the two disciplines have evolved more or less separately. In part this is because there are objections to viewing classical statistics as inferential, although this is not true for all statistical procedures. For another part, it may be because inductive logic has been dominated by the Carnapian programme. Perhaps statisticians have not recognised inductive logic as a discipline that is much like their own. However this may be, I think it is time for a rapprochement. There are, to my mind, good reasons for investigating the parallel between inductive logic and statistics along the lines suggested above. First, framing the statistical procedures as inferences in a logic may clarify the presuppositions of these procedures. Second, by relating statistics to inductive logic, techniques and insights from inductive logic may be used to enrich statistics. And finally, showing the parallels between inductive logic and statistics may show the relevance, also to inductive logicians themselves, of their discipline to the sciences, and thereby direct further research in this field. With this aim in mind, I consider a number of statistical procedures in this chapter, and I investigate whether they can be seen as part of an inductive logic, or otherwise whether they can, at least partly, be translated into such a logic. I start by describing induction in formal terms, and I introduce a general notion of Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
752
Jan-Willem Romeijn
probabilistic inductive inference. This provides a setting in which both statistical procedures and inductive logics may be captured. I subsequently discuss a number of statistical procedures, and show how they can, and cannot, be captured by certain inductive logics. The first statistical procedure is Neyman-Pearson hypotheses testing. This procedure was introduced as explicitly non-inferential, and so it should strictly speaking not be captured by an inductive logic. On the other hand, power and significance are often interpreted inferentially. At the end of the chapter I devise an inductive logic that may be used to warrant such an interpretation. The second statistical procedure is parameter estimation. I briefly discuss Fisher’s theory of maximum likelihood estimators, and I show that there is a certain relation with the inductive logic developed by Carnap. A third statistical procedure is Bayesian statistics. I show that it can be captured in a probabilistic inductive logic that relates to Carnapian inductive logic via the representation theorem of de Finetti. This leads to a discussion of Bayesian statistics in relation to Bayesian inductive logic. Given the nature of the chapter, the discussion of statistical procedures is relatively short. Many procedures cannot be dealt with. Similarly, I cannot discuss in detail the many inductive logics devised within Carnapian inductive logic. For the former, the reader may consult other chapters in this volume, in particular the chapter by Festa. For the latter, I refer to [Hartmann et al., 2009], specifically the discussions of inductive logic contained therein. 2
OBSERVATIONAL DATA
As indicated, inductive inference starts from propositions on data, and ends in propositions that extend beyond the data. An example of an inductive inference is that, from the proposition that up until now all observed pears were green, we conclude that the next few pears will be green as well. Another example is that from the green pears we have seen we conclude that all pears are green, period. The key characteristic is that the conclusion says more than what is classically entailed by the premises. Let me straighten these inferences out a bit. First, I restrict attention to propositions on empirical facts, thus leaving aside such propositions as that pears are healthy, or that God made them. Second, I focus on the results of observations of particular kinds of empirical fact. For example, the empirical fact at issue is the colour of pears, and the results of the observations are therefore colours of individual pears. There can in principle be an infinity of such observation results, but what I call data is always a finite sequence of them. Third, the result of an observation is always one from a designated partition of properties, usually finite but always countable. In the pear case, it may be {red, green, yellow}. I leave aside observations that cannot be classified in terms of a mutually exclusive set of properties. I now make these ideas on what counts as data a bit more formal. The concept
Statistics as Inductive Inference
753
I want to get across is that of a sample space, in which single observations and sequences of observations can be represented as sets, called events. After introducing the observations in terms of a language, I define sample space. All the probabilities in this chapter will be defined over sample space, because probability is axiomatized as a measure function over sets. However, the expressions may be taken as sentences from a logical language just as well. We denote the observation of individual i by Qi . This is a propositional variable, and we denote assignments or valuations of this variable by qik , which represents the sentence that the result of observing individual i is the property k. A sequence of such results of length t, starting at 1, is denoted with the propositional variable St , and its assignment with sk1 ...kt , often abbreviated as st . In order to simplify notation, I denote properties with natural numbers, so k ∈ K = {0, 1, . . . , n − 1}. For example, if the observations are the aforementioned colours of pears, then n = 3. I write red as 0, green as 1, and yellow as 2, so s012 says that the first three pairs were red, green, and yellow respectively. Note further that there are logical relations among the sentences, like s012 → q21 . Together, the expressions st and qik form the observation language. Now we develop a set-theoretical representation of the observations, a so-called sample space, otherwise known as an observation algebra. To this aim, consider the set of all infinitely long sequences K ω , that is, all sequences like 012002010211112 . . ., each encoding the observations of infinitely many pears. Denote such sequences with u, and write u(i) for the i-th element in the sequence u. Every sentence qik can then be associated with a particular set of such sequences, namely the set of u whose i-th element is k: qik = {u ∈ K ω : u(i) = k}. Clearly, we can build up all finite sequences of results sk1 ...kt as intersections of such sets: t \ qiki . sk1 ...kt = i=1
Note that entailments in the language now come out as set inclusions: we have s012 ⊂ q21 . Instead of a language with sentences qik and logical relations among such sentences, I will in the following employ the algebra Q, built up by the sets qik and their conjunctions and intersections. I want to emphasise that the notion of a sample space introduced here is really quite general. It excludes a continuum of individuals and a continuum of properties, but apart from that, any data recording that involves individuals and that ranges over a set of properties can serve as input. For example, instead of pears having colours we may think of subjects having test scores. Or of companies having certain stock prices. The sample space used in this chapter follows the basic structure of most applications in statistics, and of almost all applications in inductive logic.
754
Jan-Willem Romeijn
3
INDUCTIVE INFERENCE
Now that I have made the notion of data more precise, let me turn to inductive inference. Consider the case in which I have observed three green pears: s111 . What can I conclude about the next pear? Or about pears in general? From the structure of the data itself, it seems that we can conclude depressingly little. We might say that the next pear is green, q41 . But as it stands, each of the sets s111k = s111 ∩ q4k , for k = 0, 1, 2, is a member of the sample space, or in terms of the logical language, we cannot derive any sentence q4k from the sentence s111 . The event of observing three green pears is consistent with any colour for the next pear. Purely on the basis of the classical relations among observations, as captured by the language and the sample space, we cannot draw any inductive conclusion. Perhaps we can say that given three green pears, the next pear being green is more probable? This is where we enter the domain of probabilistic inductive logic. We can describe the complete population of pears by a probability function over the observational facts, P : Q 7→ [0, 1]. k Every possible pear qt+1 , and also every sequence of such pears sk1 ...kt , receives a distinct probability. The probability of the next pear being of a certain colour, k conditional on a given sequence, is expressed as P (qt+1 |sk1 ...kt ). Similarly, we may wonder about the probability that all pears are green, which is again determined by the probability assignment, in this case P ({∀i : qi1 }). All such probabilistic inductive inferences are determined by the full probability function P . The central question of any probabilistic inductive inference or procedure is therefore how to determine the function P , relative to the data that we already have. What must the probability of the next observation be, given a sequence of observations gone before? And what is the right, or otherwise the preferable, distribution over all observations given the sequence? Both statistics and inductive logic aim to provide an answer to these questions, but they do so in different ways. In order to facilitate the view that the statistical procedures are logical inferences, it will be convenient to keep in mind a particular understanding of probability assignments P over the sample space, or observation algebra, Q. Recall that in classical two-valued logic, a model of the premises is a complete truth valuation over the language, subject to the rules of logic. Because of the correspondence between language and algebra, the model is also a complete function over the algebra, taking the values {0, 1}. Accordingly, the premises of some deductive logical argument are represented as a set of models over the algebra. By analogy, we may consider a probability function over an observation algebra as a model too. Just like the truth value assignment, the probability function is a function over an algebra, only it takes values in the interval [0, 1], and it is subject to the axioms of probability. Probabilistic inductive logics use probability models for the purpose of inductive inference. In particular, the premises of a probabilistic inductive argument can be represented as a set, possibly a singleton, of probability assignments. But
Statistics as Inductive Inference
755
there are widely different ways of understanding the inferential step, i.e., the step running from the premises to the conclusion. The most straightforward of these, and the one that is closest to classical statistical practice, is to associate a probability function P , or otherwise a set of such functions, with each sample st . The inferential step then runs from the data st and a large set of probability functions P , possibly all conceivable functions, towards a more restricted set, or even towards a single P . The resulting inductive logic is called ampliative, because the restriction on the set of probability functions that is effected by the data, i.e. the conclusion, is often stronger than what follows from the data and the initial set of probability functions, i.e. the premises, by deduction. We can also make the inferential step precise by analogy to a more classical, non-ampliative notion of entailment. As will become apparent, this kind of inferential step is more naturally associated with what is traditionally called inductive logic. It is also associated with a basic kind of probabilistic logic, as elaborated in [Hailperin, 1996] and more recently in [Haenni et al., 2009], especially section 2. Finally, this kind of inference is strongly related to Bayesian logic, as advocated by [Howson, 2003]. It is the kind of inductive logic favored in this chapter. An argument is said to be classically valid if and only if the set of models satisfying the premises is contained in the set of models satisfying the conclusion. The same idea of classical entailment may now be applied to the probabilistic models over sample space. In that case, the inferential step is from one set of probability assignments, characterised by a number of restrictions associated with premises, towards another set of probability assignments, characterised by a different restriction that is associated with a conclusion. The inductive inference is called valid if the former is contained in the latter, i.e., if every model satisfying the premises is also a model satisfying the conclusions. In such a valid inferential step, the conclusion does not amplify the premises. As an example, say that we fix P (q10 ) = 12 and P (q11 ) = 31 . Both these probability assignments can be taken as premises in a logical argument, and the models of these premises are simply all probability functions P over Q for which these two valuations hold. By the axioms of probability, we can derive that any such function P will also satisfy P (q12 ) = 16 , and hence also that P (q12 ) < 14 . On its own, the latter expression amounts to a set of probability functions over the sample space Q in which the probability functions that satisfy both premises are included. In other words, the latter assignment is classically entailed by the two premises. Along exactly the same lines, we may derive a probability assignment for a statistical hypothesis h conditional on the data st , written as P (h|st ), from the input probabilities P (h), P (st ), and P (st |h), using the theorem of Bayes. The classical, non-ampliative understanding of entailment may thus be used to reason inductively, towards predictions and statistical hypotheses that themselves determine a probability assignment over data. In the following I will focus primarily on such non-ampliative inductive logical inferences to investigate statistical procedures. I first discuss Neyman-Pearson hypothesis testing and Fisher’s maximum likelihood estimation in their own terms, showing that they are best understood
756
Jan-Willem Romeijn
as ampliative inductive inferences. Then I discuss Carnapian inductive logic and show that it can be viewed as a non-ampliative version of parameter estimation. This leads to a discussion of Bayesian statistical inference, which is subsequently related to a generalisation of Carnapian inductive logic, Bayesian inductive logic. The chapter ends with an application of this logic to Neyman-Pearson hypothesis testing. As indicated, Carnapian inductive logic is most easily related to non-ampliative logic. So, viewing statistical procedures in this perspective makes the latter more amenable to inductive logical analysis. But I do not want to claim that I thereby lay bare the real nature of the statistical procedures. Rather, I hope to show that the investigation of statistics along these specific logical lines clarifies and enriches statistical procedures. Furthermore, as indicated, I hope to stimulate research in inductive logic that is directed at problems in statistics.
4
NEYMAN-PEARSON TESTING
The first statistical application concerns the choice between two statistical hypotheses, that is, two fully specified probability functions over sample space. In the above vocabulary, it concerns the choice between two probabilistic models, but we must be careful with our words here, because in statistics, models often refer to sets of statistical hypotheses. In the following, I will therefore refer to complete probability functions over the algebra as hypotheses. Let H = {h0 , h1 } be the set of hypotheses, and let Q be the sample space, that is, the observation algebra introduced earlier on. We can compare the hypotheses h0 and h1 by means of a Neyman-Pearson test function. See [Barnett, 1999] and [Neyman and Pearson, 1967] for the details. DEFINITION 1 Neyman-Pearson Hypothesis Test. Let F be a function over the sample space Q, ( P (s ) 1 if Phh1 (stt ) > r, 0 (1) F (st ) = 0 otherwise, where Phj is the probability over the sample space determined by the statistical hypothesis hj . If F = 1 we decide to reject the null hypothesis h0 , else we accept h0 for the time being. Note that, in this simplified setting, the test function is defined for each set of sequences st separately. For each sample plan, and associated sample size t, we must define a separate test function. The decision to accept or reject a hypothesis is associated with the so-called
Statistics as Inductive Inference
757
significance and power of the test: SignificanceF = α =
Z
F (st )Ph0 (st )dst ,
Q
PowerF = 1 − β =
Z
F (st )Ph1 (st )dst .
Q
The significance is the probability, according to the hypothesis h0 , of obtaining data that leads us to reject the hypothesis h0 , or in short, the type-I error of falsely rejecting the null hypothesis, denoted α. Similarly, the power is the probability, according to h1 , of obtaining data that leads us to reject the hypothesis h0 , or in short, the probability under h1 of correctly rejecting the null hypothesis, so that β = 1 − Power is the type-II error of falsely accepting the null hypothesis. An optimal test is one that minimizes the significance level, and maximizes the power. Neyman and Pearson prove that the decision has optimal significance and power for, and only for, likelihood-ratio test functions F . That is, an optimal test P (s ) depends only on a threshold for the ratio Phh1 (stt ) . 0 Let me illustrate the idea of Neyman-Pearson tests. Say that we have a pear whose colour is described by q k , and we want to know from what farm it originates, from farmer Maria (h0 ) or Lisa (h1 ). We know that the colour composition of the pears from the two farms are different: Hypothesis \ Data h0 h1
q0 0.00 0.40
q1 0.05 0.30
q2 0.95 0.30
If we want to decide between the two hypotheses, we need to fix a test function. Say that we choose ( 0 if k = 2, F (q k ) = 1 else. In the definition above, which uses a threshold for the likelihood ratio, this comes 6 and 14, for example r = 1. down to choosing a value for r somewhere between 19 0 1 The significance level is Ph0 (q ∪ q ) = 0.05, and the power is Ph1 (q 0 ∪ q 1 ) = 0.70. Now say that the pear we have is green, so F = 1 and we reject the null hypothesis, concluding that Maria did not grow the pear with the aforementioned power and significance. Note that from the perspective of ampliative inductive logic, it is not too farfetched to read an inferential step into the Neyman-Pearson procedure. The test function F brings us from a sample st and two probability functions, Phj for j = 0, 1, to a single probability function Ph1 , or Ph0 , over the sample space Q. So we might say that the test function is the procedural analogue of an inductive
758
Jan-Willem Romeijn
inferential step, as discussed in Section 3. This step is ampliative because both probability functions Phj are consistent with the data. Ruling out one of them cannot be done deductively.1 Neyman-Pearson hypothesis testing is sometimes criticised because its results generally depend on the probability function over the entire sample space, and not just on the probability of those elements in sample space corresponding to the actual events, the observed sample for short. That is, the decision to accept or reject the null hypothesis against some alternative hypothesis depends not just on the probability of what has actually been observed, but also on the probability assignment over everything that could have been observed. A well-known illustration of this problem concerns so-called optional stopping. But here I want to illustrate the same point with an example that can be traced back to [Jeffreys, 1931] p. 357, and of which a variant is discussed in [Hacking, 1965].2 Instead of the hypotheses h0 and h1 above, say that we compare the hypotheses h⋆0 and h1 .
Hypothesis \ Data h⋆0 h1
q0 0.05 0.40
q1 0.05 0.30
q2 0.90 0.30
We determine the test function F (q k ) = 1 iff k = 0, by requiring the same significance level, Ph⋆0 (q 0 ) = 0.05, resulting in the power Ph1 (q 0 ) = 0.40. Now imagine that we observe q 1 again. Then we accept h⋆0 . But this is a bit odd, because the hypotheses h0 and h⋆0 have the same probability for q 1 ! So how can the two test procedures react differently to this observation? It seems that, in contrast to h0 , the hypothesis h⋆0 escapes rejection because it allocates some probability to s0 , an event that does not occur. This causes a shift in the area within sample space on which the hypothesis h⋆0 is rejected. This phenomenon gave rise to the famed complaint of Jeffreys that “a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred: if indeed h0 is true, it will be rejected after observing q1 , simply because it gives q0 a zero probability.” This illustrates how the results of a Neyman-Pearson procedure depends on the whole probability function that a hypothesis defines over the sample space, and 1 There are attempts to make these ampliative inferences more precise, by means of a form of default reasoning, or a reasoning that employs a preferential ordering over probability models. Specifically, so-called evidential probability, proposed by [Kyburg, 1974] and more recently discussed by [Wheeler, 2006], is concerned with inferences that combine statistical hypotheses, which are each accepted with certain significance levels. However, in this chapter I will not investigate these logics. They are not concerned with inferences from the data to predictions or to hypotheses, but rather with inferences from hypotheses to other hypotheses, and from hypotheses to predictions. 2 I would like to thank Jos Uffink for bringing this example to my attention. As far as I can see, the exact formulation of the example is his.
Statistics as Inductive Inference
759
not just on the probability defined for the actual observation. From the perspective of an inductive logician, it may therefore seem “a remarkable procedure”, to cite Jeffreys again. But it must be emphasised that Neyman-Pearson statistics was never intended as an inference in disguise. It is a procedure that allows us to decide between two hypotheses on the basis of data, generating error rates associated with that decision. Neyman and Pearson themselves were very explicit that the procedure must not be interpreted inferentially. Rather than inquiring into the truth and falsity of a hypothesis, they were interested in the probability of mistakenly deciding to reject or accept a hypothesis. The significance and power concern the probability over data given a hypothesis, not the probability of hypotheses given the data.3
5
FISHER’S PARAMETER ESTIMATION
Let me turn to another important classical statistical procedure, so-called parameter estimation. I focus in particular on an estimation procedure first devised by Fisher, estimation by maximum likelihood. The maximum likelihood estimator determines the best among a much larger, possibly infinite, set of hypotheses. Again it depends entirely on the probability that the hypotheses assign to points in the sample space. See [Barnett, 1999] and [Fisher, 1956] for more detail. DEFINITION 2 Maximum Likelihood Estimation. Let H = {hθ : θ ∈ Θ} be a set of hypotheses, labeled by the parameter θ, and let Q be the sample space. Then the maximum likelihood estimator of θ, ˆ t ) = {θ : ∀hθ′ Ph ′ (st ) ≤ Ph (st ) }, θ(s θ θ
(2)
θˆ for short, is a function over the elements st in the sample space. So the estimator is a set, typically a singleton, of those values of θ for which the likelihood of hθ on the data st is maximal. The associated best hypothesis we denote with hθˆ. Note that this estimator is a function over the sample space, associating each st with a hypothesis, or a set of them. Often the estimation is coupled to a so-called confidence interval. Restricting the parameter space to Θ = [0, 1] for convenience, and assuming that the true value is θ, we can define a region in sample space within which the estimator function is not too far off the mark. Specifically, we might set the region in such a way that it covers 1 − ǫ of the probability Phθ : Z
θ+∆
θ−∆
ˆ θˆ = 1 − ǫ. Phθ (θ)d
3 This is not to say that Neyman-Pearson statistics cannot be viewed from an inferential angle. See Section 9 for an inferential account.
760
Jan-Willem Romeijn
We can provide an unproblematic frequentist interpretation of the interval θˆ ∈ [θ−∆, θ+∆]: in a series of estimations, the fraction of times in which the estimator θˆ is further off the mark than ∆ will tend to ǫ. The smaller the region, the more reliable the estimate. Note, however, that this interval is defined in terms of the unknown true value θ. Some applications allow for the derivation of a region of parameter values within which the true value θ can be expected to lie.4 The general idea is to define a set of parameter values R within which the data are not too unlikely, R(st ) = {θ : Phθ (st ) > ǫ} for some small value ǫ > 0. Now in terms of the integral above, we can swap the roles of θ and θˆ and define the so-called central confidence interval: ˆ = Conf1−ǫ (θ)
(
ˆ < ∆ and θ : |θ − θ|
Z
ˆ θ+∆
ˆ θ−∆
)
ˆ Phθ (θ)dθ =1−ǫ .
ˆ t ), every element of the sample space st is assigned a region Via the function θ(s Conf1−ǫ of parameter values, interpreted as the region within which we may expect to find the true value θ. Note, however, that swapping the roles of θ and θˆ in the integral is not unproblematic. We can only interpret the integral as a probability ˆ for all values of δ, or in other words, if for fixed θˆ the if Phθ (θˆ + δ) = Phθ−δ (θ) ˆ function Phθ (θ) is indeed a probability density over θ. In other cases, the interval cannot be taken as expressing the expected accuracy of the estimate, or at least not without further critical reflection. Let me illustrate parameter estimation in a simple example on pears again. Say that we are interested in the colour composition of pears from Emma’s farm, and that her pears are red, qi0 , or green, qi1 . Any ratio between these two kinds of pears is possible, so we have a set of hypotheses hθ , called multinomial hypotheses, for which Phθ (qt1 |st−1 ) = θ, Phθ (qt0 |st−1 ) = 1 − θ (3) with θ ∈ [0, 1]. The hypothesis hθ fixes the portion of green pears at θ, and therefore, independently of what pears we saw before, on the assumption of the hypothesis hθ the probability that a randomly drawn pear from Emma’s farm is green is θ. The type of distribution over Q that is induced by these hypotheses is called a Bernoulli distribution, or a multinomial distribution. 4 The determination of such regions is similar in nature to the determination of so-called fiducial probability. [Fisher, 1930; Fisher, 1935; Fisher, 1956] developed the notion of fiducial probability as a way of capturing parameter estimation in terms of a non-ampliative entailment relation, basically deriving a probability assignment over hypotheses without assuming a prior probability over statistical hypotheses at the outset. The fiducial argument is controversial, however, and its applicability is limited to particular statistical problems. [Seidenfeld, 1979] provides a detailed discussion of the restricted applicability of the argument in cases with multiple parameters. [Dawid and Stone, 1982] argue that in order to run the fiducial argument, one has to assume that the statistical problem can be captured in a functional model that is smoothly invertible. In this chapter, I will not discuss the fiducial argument and explore another nonampliative representation instead.
Statistics as Inductive Inference
761
The idea of Fisher’s maximum likelihood estimation is that we choose the value of θ for which the probability that the hypotheses hθ gives to the data sk1 ...kt is maximal. Say that we have observed a sequence of pears s000101 . The probability of these data given the hypothesis hθ is Phθ (s000101 ) =
t Y
i=1
Phθ (qiki |si−1 ) = θ2 (1 − θ)4 .
(4)
Note that the probability of the data only depends on the number of 0’s and the number of 1’s in the sequence. Now the above likelihood function is maximal at θ = 31 , so θˆ = 31 . More generally, defining t1 as the number of 1’s in the sequence st , the maximum likelihood estimator is ˆ t ) = t1 . θ(s t
(5)
Note finally that for a true value θ, the probability of finding the estimate in the interval tt1 ∈ [θ − ∆, θ + ∆] increases for larger data sequences. Fixing the probability at 1 − ǫ, the size of the interval will therefore decrease with increasing sample size. This completes the introduction into parameter estimation. The thing to note is that the statistical procedure can be taken as the procedural analogue of an ampliative logical inference, running from the data to a probability assignment over the sample space. We have H as the set of probability models from which the inference starts, and by means of the data we then choose a single hθˆ of these, or a set C95 , as our conclusion. In the following I aim to investigate whether there is a non-ampliative logical representation of this inductive inference. 6
CARNAPIAN LOGICS
A straightforward way of capturing parameter estimation in a logic is by relating it to the logic of induction developed by [Carnap, 1950; Carnap, 1952]. Historically, Carnapian inductive logic can lay most claim to the title of inductive logic proper. It was the first systematic study into probabilistic predictions on the basis of data. The central concept in Carnapian inductive logic is logical probability. Recall that the sample space Q, also called the observation algebra, corresponds to an observation language, comprising of sentences such as “the second pear is green”, or formally, q21 . The original idea of Carnap was to derive a probability assignment over the language on the basis of symmetries within the language. In the example, we have three mutually exclusive properties for each pear, and in the absence of any further knowledge, there is no reason to think of any of these properties as special or as more, or less, appropriate than the other two. The symmetry inherent to the language suggests that each of the sentences qik for k = 0, 1, 2 should get equal probability: 1 P (qi0 ) = P (qi1 ) = P (qi2 ) = . 3
762
Jan-Willem Romeijn
The idea of logical probability is to fix a unique probability function over the observation language, or otherwise a strongly restricted set of such functions, on the basis of symmetries. Next to symmetries, the set of probability functions can also be restricted by certain predictive properties. As an example, we may feel that yellow pears are more akin to green pears, so that finding a yellow pear decreases the probability for red pears considerably, while it decreases the probability for green pears much less dramatically. That is, 1 |st−1 ∩ qt2 ) P (qt+1 0 |s 2 P (qt+1 t−1 ∩ qt )
>
1 |st−1 ) P (qt+1 . 0 P (qt+1 |st−1 )
How such relations among properties may play a part in determining the probability assignment P is described in the literature on analogy reasoning. See [Festa, 1996; Maher, 2000; Romeijn, 2006]. Interesting recent work on relations between predictive properties in the context of analogical predictions can also be found in [Paris and Waterhouse, 2008]. Any Carnapian inductive logic is defined by a number of symmetry principles and predictive properties, determining a probability function, or a set of such functions. One very well-known inductive logic, discussed at length in [Carnap, 1952], employs a probability assignment characterised by the following symmetries, ′
P (qik ) = P (qik ), P (sk1 ...ki ...kt ) = P (ski ...k1 ...kt ),
(6)
for all values of i, t, k, and k ′ , and for all values ki with 1 ≤ i ≤ t. The latter of these is known as the exchangeability of observations: the order in the observations does not matter to their probability. The inductive logic at issue employs a particular version of exchangeability, known as the requirement of restricted relevance, k P (qt+1 |st ) = f (tk , t), (7)
where tk is the number of earlier instances qik in the sequence st and t the total number of observations. Together these symmetries and predictive properties determine a particular set of probability assignments P , for which we can derive the following consequence: tk + λn k P (qt+1 |st ) = , (8) t+λ where n the number of values for k. The parameter 0 ≤ λ ≤ ∞ can be chosen at will. Predictive probability assignments of this form are called Carnapian λ-rules. The probability distribution in Equation (8) has some striking features. Most importantly, for any of the probability functions P satisfying the aforementioned symmetries, we have that k k P (qt+1 |st−1 ∩ qtk ) > P (qt+1 |st−1 ).
Statistics as Inductive Inference
763
This predictive property is called instantial relevance: the occurrence of qtk ink . It was a success for Carnap that this typically creases the probability for qt+1 inductive effect is derivable from the symmetries alone. By providing an independent justification for these symmetries, Carnap effectively provided a justification for induction, thereby answering the age-old challenge of Hume.5 Note that the outlook of Carnapian logic is very different from the outlook of classical statistical procedures, like Fisher’s parameter estimation or NeymanPearson testing. Classical statistics starts with statistical hypotheses, each associated with a probability functions over a sample space, and then chooses the best fitting one on the basis of the data. By contrast, Carnapian logic starts with a sample space and a number of symmetry principles and predictive properties, that together fix a set of probability functions over the sample space. Just like the truth tables restrict the possible truth valuations, so do these principles restrict the logical probability functions, albeit not to a singleton, as λ can still be chosen freely. But from the point of view of statistics, Carnap is thereby motivating, from logical principles, the choice for a particular set of hypotheses. Recall that classical statistics was naturally associated with ampliative inductive inference. By contrast, if we ignore the notion of logical probability and concentrate on the inferential step, Carnapian inductive logics fall very neatly within the template for non-ampliative inductive logic that I laid down at the beginning. By means of a number of symmetry principles and predictive properties, we fix a set of probability assignments over the sample space. The conclusions are then reached by working out specific consequences for probability functions within this set, using the axioms of probability. In particular, Carnapian inductive logic looks at the probability assignments conditional on various samples st , deriving that they all satisfy instantial relevance, for example. Importantly, the symmetries in the language appear as premises in the inductive logical inference. They restrict the set of probability assignments that is considered in the inference. Despite these differences in outlook, ampliative against non-ampliative, we can identify a strong similarity between parameter estimation, as discussed in Section 5, and the predictive systems of Carnapian logic. To see this, note that the procedure of parameter estimation can be used to determine the probability of the next piece of data. In the example on pears, once we have observed s000101 and thus chosen h 31 , we may on the basis of that predict that the next pear has a probability of 31 to be green. In other words, the function θˆ is a predictive system, much like any other Carnapian inductive logic. We can write k k (qt+1 ). P (qt+1 |st ) = Phθ(s ˆ ) t
The estimation function θˆ by Fisher is thus captured in a single probability function 5 As recounted in [Zabell, 1982], earlier work that connects exchangeability to the predictive properties of probability functions was done by [Johnson, 1932] and [de Finetti, 1937]. But the specific relation with Hume’s problem noted here is due to Carnap: he motivated predictive properties such as Equation (8) independently, by the definition of logical probability, whereas for the subjectivist de Finetti these properties did not have any objective grounding.
764
Jan-Willem Romeijn
P . So we can present the latter as a probability assignment over sample space, from which estimations can be derived by a non-ampliative inference. Let me make this concrete by means of the example on red and green pears. In the Carnapian prediction rule of Equation (8), choosing λ = 0 will yield the observed relative frequency as predictions. And according to Equation (5) these relative frequencies are also the maximum likelihood estimators. Thus, for each set of possible observations, {sk1 ...kt : ki = 0, 1}, the Carnapian rule with λ = 0 predicts according to the Fisherian estimate.6 Unfortunately the alignment of Fisher estimation and Carnapian inductive logic is rather problematic. Already for estimations for multinomial hypotheses, it is not immediate how we can define the corresponding probability assignment over sample space, and whether we thereby define a coherent probability function at all. For more complicated sets of hypotheses, and the more complicated estimators associated with it, the corresponding probability assignment P may be even less natural, or possibly incoherent. Moreover, the principles and predictive properties that may motivate the choice of that probability function will be very hard to come by. In the following I will therefore not discuss the further intricacies of capturing Fisher’s estimation functions by Carnapian prediction rules. However, Carnapian rules will make a reappearance in the next two sections, because in a much more straightforward sense, they are the predictive counterpart to Bayesian statistics. indexFisher, R. A.
7
BAYESIAN STATISTICS
The defining characteristic of Bayesian statistics is that probability assignments do not just range over data, but that they can also take statistical hypotheses as arguments. As will be seen in the following, Bayesian inference is naturally represented in terms of a non-ampliative inductive logic, and it also relates very naturally to Carnapian inductive logic. Let H be the space of statistical hypotheses hθ , and let Q be the sample space as before. The functions P are probability assignments over the entire space H × Q. Since hθ is a member of the combined algebra, it makes sense to write P (st |hθ ) instead of the Phθ (st ) written in the context of classical statistics. We can define Bayesian statistics as follows. DEFINITION 3
Bayesian Statistical Inference. Assume the prior probability
6 Note that the probability function P that describes the estimations is a rather unusual one. After three green pears for example, s111 , the probability for the next pear to be red will be 0, so that P (s1110 ) = 0. By the standard axiomatisation and definitions of probability, the probability of any observation q50 conditional on s1110 is not defined. But if the probability function P is supposed to follow the Fisherian estimations, then we must have P (q50 |s1110 ) = 41 . To accommodate the probability function imposed by Fisher’s estimations, we may change the axiomatisation of probability. In particular, we may adopt an axiomatisation in which conditional probability is primitive, as described in [R´ enyi, 1970]. Alternatively, we can restrict ourselves to estimations based on the observation of more than one property.
Statistics as Inductive Inference
765
P (hθ ) assigned to hypotheses hθ ∈ H, with θ ∈ Θ, the space of parameter values. Further assume P (st |hθ ), the probability assigned to the data st conditional on the hypotheses, called the likelihoods. Bayes’ theorem determines that P (hθ |st ) = P (hθ )
P (st |hθ ) . P (st )
(9)
Bayesian statistics outputs the posterior probability assignment, P (hθ |st ).
See [Barnett, 1999] and [Press, 2003] for a more detailed discussion. The further results form a Bayesian inference, such as estimations and measures for the accuracy of the estimations, can all be derived from the posterior distribution over the statistical hypotheses. In this definition the probability of the data P (st ) is not presupposed, because it can be computed from the prior and the likelihoods by the law of total probability, Z P (st ) = P (hθ )P (st |hθ )dθ. Θ
The result of a Bayesian statistical inference is not always a posterior probability. Often the interest is only in comparing the ratio of the posteriors of two hypotheses. By Bayes’ theorem we have P (hθ )P (st |hθ ) P (hθ |st ) = , P (hθ′ |st ) P (hθ′ )P (st |hθ′ ) and if we assume equal priors P (hθ ) = P (hθ′ ), we can use the ratio of the likelihoods of the hypotheses, the so-called Bayes factor, to compare the hypotheses. Let me give an example of a Bayesian procedure. Consider the hypotheses of Equation (3), concerning the fraction of green pears in Emma’s orchard. Instead of choosing among them on the basis of the data, assign a so-called Beta-distribution over the range of hypotheses, P (hθ ) ∝ θλ/2−1 (1 − θ)λ/2−1
(10)
with θ ∈ Θ = [0, 1]. For λ = 2, this function is uniform over the domain. Now say that we obtain a certain sequence of pears, s000101 . By the likelihood of the hypotheses as given in Equation (4), we can derive P (hθ |s000101 ) = θλ/2+1 (1 − θ)λ/2+3 . More generally, the likelihood function for the data st with numbers tk of earlier instances qik is θt1 (1 − θ)t0 , so that P (hθ |st ) ∝ θλ/2−1+t1 (1 − θ)λ/2−1+t0 .
(11)
is the posterior distribution over the hypotheses. This posterior is derived by the axioms of probability theory alone, specifically by Bayes’ theorem.
766
Jan-Willem Romeijn
As said, capturing this statistical procedure in a non-ampliative inference is relatively straightforward. The premises are the prior over the hypotheses, P (hθ ) for θ ∈ Θ, and the likelihood functions, P (st |hθ ) over the algebras Q, which are determined for each hypothesis hθ separately. These premises are such that only a single probability assignment over the space H × Q remains. In other words, the premises have a unique probability model. Moreover, all the conclusions are straightforward consequences of this probability assignment. They can be derived from the assignment by applying theorems of probability theory, primarily Bayes’ theorem. Before turning to the relation of Bayesian inference with Carnapian logic, let me compare it to the classical procedures sketched in the foregoing. In all cases, we consider a set of statistical hypotheses, and in all cases our choice among these is informed by the probability of the data according to the hypotheses. The difference is that in the two classical procedures, this choice is absolute: acceptance, rejection, and the appointment of a best estimate. In the Bayesian procedure, by contrast, all this is expressed in a posterior probability assignment over the set of hypotheses. Note that this posterior over hypotheses can be used to generate the kind of choices between hypotheses that classical statistics provides. Consider Fisherian parameter estimation. We can use the posterior to derive an expectation for the parameter θ, as follows: Z E[θ] = θP (hθ |st )dθ. (12) Θ
Clearly, E[θ] is a function that brings us from the hypotheses hθ and the data st to a preferred value for the parameter. The function depends on the prior probability over the hypotheses, but it is in a sense analogous to the maximum likelihood estimator. In analogy to the confidence interval, we can also define a so-called credal interval from the posterior probability distribution: ( ) Z E[θ]+d
Cred1−ǫ =
θ : |θ − E[θ]| < d and
E[θ]−d
P (hθ |st )dθ = 1 − ǫ .
This set of values for θ is such that the posterior probability of the corresponding hθ jointly add up to 1 − ǫ of the total posterior probability. Most of the controversy over the Bayesian method concerns the determination and interpretation of the probability assignment over hypotheses. As for interpretation, classical statistics objects to the whole idea of assigning probabilities to hypotheses. The data have a well-defined probability, because they consist of repeatable events, and so we can interpret the probabilities as frequencies, or as some other kind of objective probability. But the probability assigned to a hypothesis cannot be understood in this way, and instead expresses an epistemic state of uncertainty. One of the distinctive features of classical statistics is that it rejects such epistemic probability assignments, and that it restricts itself to a straightforward interpretation of probability as relative frequency.
Statistics as Inductive Inference
767
Even if we buy into this interpretation of probability as epistemic uncertainty, how do we determine a prior probability? At the outset we do not have any idea of which hypothesis is right, or even which hypothesis is a good candidate. So how are we supposed to assign a prior probability to the hypotheses? The literature proposes several objective criteria for filling in the priors, for instance by maximum entropy or by other versions of the principle of indifference, but something of the subjectivity of the starting point remains. The strength of the classical statistical procedures is that they do not need any such subjective prior probability. 8
BAYESIAN INDUCTIVE LOGIC
While Bayesian statistics differs strongly from classical statistics, it is much more closely related to the inductive logic of Carnap. In this section I will elaborate on this relation, and indicate how Bayesian statistical inference and inductive logic may have a fruitful common future. To see how Bayesian statistics and Carnapian inductive logic hang together, note first that the result of a Bayesian statistical inference, namely a posterior, is naturally translated into the result of a Carnapian inductive logic, namely a prediction, Z 1 1 1 P (qt+1 |st ) = P (qt+1 |hθ ∩ st )P (hθ |st )dθ, (13) 0
by the law of total probability. Furthermore, consider the posterior probability over multinomial hypotheses. Recall that the parameter θ is the probability for the next pear to be green, as defined in Equation (3). By Equations (12) and (13) we have Z E[θ] = θP (hθ |st )dθ =
Z
Θ 1
0
1 P (qt+1 |hθ ∩ st )P (hθ |st )dθ
1 = P (qt+1 |st ),
This shows that in the case of multinomial statistical hypotheses, the expectation value for the parameter is the same as a predictive probability. The correspondence between Bayesian statistics and Carnapian inductive logic is in fact even more striking. We can work out the integral of Equation line (13), using Equation (10) as the prior and the multinomial hypotheses defined in Equation 3, to obtain t1 + λ2 1 . (14) P (qt+1 |st ) = t+λ This means that there is a specific correspondence between certain kinds of predictive probabilities, as described by the Carnapian λ-rules, and certain kinds of Bayesian statistical inferences, namely with multinomial hypotheses and priors
768
Jan-Willem Romeijn
from the family of Dirichlet distributions, which generalise the Beta-distributions used in the foregoing. On top of this, the equivalence between Carnapian inductive logic and Bayesian statistical inference is more general than is shown in the foregoing. Instead of the well-behaved priors just considered, we might consider any functional form as a prior over the hypotheses hθ , and then wonder what the resulting predictive probability is. As de Finetti showed in his representation theorem, the resulting predictive probability will always comply to a predictive property known as exchangeability, which was given in Equation (6). Conversely, and more surprisingly, any predictive probability complying to the property of exchangeability can be written down in terms of a Bayesian statistical inference with multinomial hypotheses and some prior over these hypotheses. In sum, de Finetti showed that there is a one-to-one correspondence between the predictive property of exchangeability on the one hand, and Bayesian statistical inferences using multinomial hypotheses on the other. It is insightful to make this result by de Finetti explicit in terms of the nonampliative inductive logic discussed in the foregoing. Recall that a Bayesian statistical inference takes a prior and likelihoods as premises, leading to a single probability assignment over the space H × Q as the only assignment satisfying the premises. We infer probabilistic consequences, such as the posterior and the predictions, from this probability assignment. Similarly, a Carnapian inductive logic is characterised by a single probability assignment, defined over the space Q, from which the predictions can be derived. The representation theorem by de Finetti effectively shows an equivalence between these two probability assignments: when it comes to predictions, we can reduce the probability assignment over H × Q to an assignment over Q only. For de Finetti, this equivalence was very welcome. He had a strictly subjectivist interpretation of probability, believing that probability expresses uncertain belief only. Moreover, he was eager to rid science of its metaphysical excess baggage to which, in his view, the notion of objective chance belonged. So de Finetti applied his representation theorem to argue against the use of multinomial hypotheses, and thereby against the use of statistical hypotheses more generally. Why refer to these obscure chances if we can achieve the very same statistical ends by employing the unproblematic notion of exchangeability? The latter is a predictive property, and it can hence be interpreted as an empirical and as a subjective notion. The fact is that statistics, as it is used in the sciences, is persistent in its use of statistical hypotheses. Therefore I want to invite the reader to consider the inverse application of de Finetti’s theorem. Why does science use these obscure objective chances? As I argue extensively in [Romeijn, 2005], the reason is that statistical hypotheses provide invaluable help by, indirectly, pinning down the probability assignments over Q that have the required predictive properties. Rather than reducing the Bayesian inferences over statistical hypotheses to inductive predictions over observations, we can use the representation theorem to capture relations between observations in an insightful way, namely by citing the statistical
Statistics as Inductive Inference
769
hypotheses that may be true of the data. As further illustrated in [Romeijn, 2004; Romeijn, 2006], enriching inductive logic in this way improves the control that we have over predictive properties. Finally, it may be noted that this view on inductive logic is comparable to the “presupposition view” in [Festa, 1993], which takes a similar line with regards to the choice of λ in Carnapian inductive logic. It is also strongly related to the views expressed by Hintikka in [Auxier and Hahn, 2006], and I want to highlight certain aspects of this latter view in particular. In response to Kuipers’ overview of inductive logic, Hintikka writes that “Inductive inference, including rules of probabilistic induction, depends on tacit assumptions concerning the nature of the world. Once these assumptions are spelled out, inductive inference becomes in principle a species of deductive inference.” Now the symmetry principles and predictive properties used in Carnapian inductive logic are exactly the tacit assumptions Hintikka speaks about. As explained in the foregoing, the use of particular statistical hypotheses in a Bayesian inference comes down to the very same set of assumptions, but now these assumptions are not tacit anymore: they have been made explicit as the choice for a particular set of statistical hypotheses. Therefore, the use of statistical hypotheses that I have advertised above may help us to get closer to the ideal of inductive logic envisaged by Hintikka.
9
NEYMAN-PEARSON TEST AS AN INFERENCE
In this final section, I investigate whether we can turn the Neyman-Pearson procedure of Section 4 into an inference within Bayesian inductive logic. This might come across as a pointless exercise in statistical yoga, trying to make Neyman and Pearson relax in a position that is far from natural. However, the exercise will nicely illustrate the use of Bayesian inductive logic. Moreover, I think that it will bring Neyman-Pearson testing and inductive logic closer together, and thereby stimulate research on the intersection of inductive logic and statistics in the sciences. An additional reason for investigating Neyman-Pearson hypothesis testing in this framework is that in many practical applications, scientists are tempted to read the probability statements about the hypotheses inversely: the significance is often taken as the probability that the null hypothesis is true. Although emphatically wrong, this inferential reading has a strong intuitive appeal to users. The following will make explicit that in this reading, the Neyman-Pearson procedure is effectively taken as a kind of non-ampliative inductive inference. First, we construct the space H × Q, and define the probability functions Phj over the sample spaces hhj , Qi. For the prior probability assignment over the two hypotheses, we take P (h0 ) ∈ (l, u), meaning that l < P (h0 ) < u. Finally, we adopt the restriction that P (h0 ) + P (h1 ) = 1. This defines a set of probability functions over the entire space, serving as a starting point of the inference. Next we include the data in the probability assignments. Crucially, we coarse-
770
Jan-Willem Romeijn
grain the observations to the simple observation f j , with f j = {st : F (st ) = j}, so that the observation simply encodes the value of the test function. It follows from this coarse-graining that we obtain the type-I and type-II errors as the likelihoods of the observations, P (f 1 |h0 ) = α,
P (f 0 |h1 ) = β.
Finally we use Bayes’ theorem to derive a set of posterior probability distributions over the hypotheses, according to P (f j |h1 )P (h1 ) P (h1 |f j ) = . P (h0 |f j ) P (f j |h0 )P (h0 ) Note that the quality of the test, in terms of size and power, will be reflected in the posteriors. If, for example, we find an observation st that allows us to reject the null hypothesis, so f 1 , then as long as α < 1 − β, meaning that the significance is smaller than the power, we find that P (h0 |f 1 ) < P (h0 ) and P (h0 |f 1 ) < P (h0 ). The larger the difference between significance and power, the larger the difference between posteriors and priors. Note, however, that we have not yet decided on a fully specified prior probability over the statistical hypotheses. This echoes the fact that classical statistics does not make use of a prior probability. However, it is only by restricting the prior probability over hypotheses in some way or other that we can make the Bayesian rendering of the results of Neyman and Pearson work. In particular, if we choose (l, u) = (0, 1) for the prior, then we find (l′ , u′ ) = (0, 1) for the posterior as well. However, if we choose l ≥
β , β+1−α
u ≤
1−β , 1−β+α
we find for all P (h0 ) ∈ (l, u) that P (h0 |f 1 ) < 21 < P (h1 |f 1 ). Similarly, we find P (h0 |f 0 ) > 21 > P (h1 |f 0 ). So with this interval prior, an observation st for which F (st ) = 1 tilts the balance towards h1 for all the probability functions P in the interval, and vice versa. Let me illustrate the Bayesian inference by means of the above example on pears. We set up the sample space and hypotheses as before, and we then coarse-grain the observations to f j , corresponding to the value of the test function, f 1 = q 0 ∪q 1 and f 0 = q 2 . We obtain P (f 1 |h0 ) = P (q 0 ∪ q 1 |h0 ) = α = 0.05 P (f 0 |h1 ) = P (q 0 ∪ q 1 |h1 ) = β = 0.30
Statistics as Inductive Inference
771
Choosing P (h0 ) ∈ (0.24, 0.93), this results in P (h0 |f 0 ) = (0.50, 0.98), and P (h0 |f 1 ) = (0.02, 0.50). Depending on the choice of prior, one can argue that the resulting Bayesian inference replicates the Neyman-Pearson procedure: if the probability over hypotheses expresses our preference over them, then indeed f 0 makes us prefer h0 and f 1 makes us prefer h1 . Importantly, the inference fits the entailment relation mentioned earlier: we have a set of probabilistic models on the side of the premises, namely the set of priors over H, coupled to the full probability assignments over hhj , Qi for j = 0, 1. And we have a set of models on the conclusion side, namely the set of posteriors over H. Because the latter is computed from the former by the axioms of probability, the two sets include the same probability functions. Therefore the conclusion is classically entailed by the premises. The above example shows that we can imitate the workings of a NeymanPearson test in Bayesian inductive logic, and thus in terms of a non-ampliative inductive inference. But the imitation is far from perfect. For one, the results of a Bayesian inference will always be a probability function. By contrast, NeymanPearson statistics ends in a decision to accept or reject, which is a binary decision instead of some sort of weak or inconclusive preference. Of course, there are many attempts to weld a binary decision onto the probabilistic end result of a Bayesian inference, for example in [Levi, 1980] and in the discussion on rational acceptance, e.g., [Douven, 2002]. In particular, we might supplement the probabilistic results of a Bayesian inference with rules for translating the probability assignments into decisions, e.g., we choose h0 if we have P (h0 |st ) > 12 , and similarly for h1 . However, the bivalence of Neyman-Pearson statistics cannot be replicated in a Bayesian inference itself. It will have to result from a decision-theoretic add-on to the inferential part of Bayesian statistics. More generally, the representation in probabilistic logic will probably not appeal to advocates of classical statistics. Quite apart from the issue of binary acceptance, the whole idea of assuming a prior probability, however unspecific, may be objected to on the principled ground that probability functions express long-term frequencies, and that hypotheses cannot have such frequencies. There is one attractive feature, at least to my mind, of the above rendering, that may be of interest in its own right. With the representation in place, we can ask again how to understand the example by Jeffreys, as considered in Section 4. Following [Edwards, 1972], it illustrates that Neyman and Pearson tests do not respect the likelihood principle, because they depend on the probability assignment over the entire sample space and not just on the probability of the observed sample. However, in the Bayesian representation we do respect the likelihood principle, but in addition we condition on f j , not on q k . In fact the whole example hinges on how the samples are grouped into regions of acceptance and rejection. Instead of adopting the diagnosis by Hacking concerning the likelihood principle, we could therefore say that the approach of Neyman and Pearson takes the observations in terms of a rather coarse-grained partition of information. In other words, rather than saying that Neyman-Pearson procedures violate the likelihood principle, we
772
Jan-Willem Romeijn
can also say that the procedures crucially depend on how the observed sample is framed, and thus violate the principle of total evidence.
10
IN CONCLUSION
In the foregoing I have discussed three statistical procedures, to wit, NeymanPearson hypotheses testing, Fisher’s maximum likelihood estimation, and Bayesian statistical inference. These three procedures were seen to relate to inductive logic in a variety of ways. The two classical approaches were connected most naturally to ampliative inductive inference, running from a set of probability functions and the data to a restricted set of such functions. However, I have also related both procedures to non-ampliative inferences. First I connected parameter estimation to Carnapian inductive logic. Then I related this logic to Bayesian statistical inference, which was seen to be non-ampliative already. Further, I have indicated how Carnapian inductive logic can be extended to Bayesian inductive logic, which accommodates the use of statistical hypotheses and thus captures Bayesian statistics. Finally, I have illustrated the latter logic by giving a non-ampliative account of NeymanPearson hypothesis testing. I hope that portraying statistical procedures in the setting of inductive logic has been illuminating. In particular, I hope that the relation between Carnapian inductive logic and Bayesian statistics stimulates research on the intersection of the two. Certainly, some research in this area has already been conducted; see for example [Skyrms, 1991; Skyrms, 1993; Skyrms, 1996] and [Festa, 1993]. Following these contributions, [Romeijn, 2005] argues that an inductive logic that includes statistical hypotheses in its language is closely related to Bayesian statistical inference, and some of these views have been reiterated in this chapter. However, I believe that there is much room for improvement. Research on the intersection of inductive logic and statistical inference can certainly enhance the relevance of inductive logical systems to scientific method and the philosophy of science. In parallel, I believe that insights from inductive logic may help to clarify the foundations of statistics.
ACKNOWLEDGEMENTS I want to thank Roberto Festa for comments on an earlier draft of this chapter, and Jos Uffink for drawing my attention to Jeffreys’ amusing quote on the “remarkable procedure”. I also thank the Spanish Ministry of Science and Innovation (Research project FFI2008-1169) for financial support. This research was carried out as part of a project funded by the Dutch Organization of Scientific Research (NWO VENIgrant nr. 275-20-013).
Statistics as Inductive Inference
773
BIBLIOGRAPHY [Auxier and Hahn, 2006] R.E. Auxier and L.E. Hahn, editors. The Philosophy of Jaako Hintikka. Open Court, Chicago, 2006. [Barnett, 1999] V. Barnett. Comparative Statistical Inference. John Wiley, New York, 1999. [Carnap, 1950] R. Carnap. Logical Foundations of Probability. University of Chicago Press, 1950. [Carnap, 1952] Rudolf Carnap. The Continuum of Inductive Methods. University of Chicago Press, Chicago, 1952. [Dawid and Stone, 1982] A. P. Dawid and M. Stone. The functional-model basis of fiducial inference (with discussion). Annals of Statistics, 10(4):1054–1074, 1982. [de Finetti, 1937] B. de Finetti. La pr´ evision: ses lois logiques, ses sources subjectives. Annales de l’Institut Henri Poincar´ e, 7(1):1–68, 1937. [Douven, 2002] I. Douven. A new solution to the paradoxes of rational acceptability. The British Journal for the Philosophy of Science, 53:391–410, 2002. [Edwards, 1972] A.W.F. Edwards. Likelihood. Cambridge University Press, 1972. [Festa, 1993] R. Festa. Optimum Inductive Methods. Dordrecht: Kluwer, 1993. [Festa, 1996] R. Festa. Analogy and exchangeability in predictive inferences. Erkenntnis, 45:89– 112, 1996. [Fisher, 1930] Ronald A. Fisher. Inverse probability. Proceedings of the Cambridge Philosophical Society, 26:528–535, 1930. [Fisher, 1935] Ronald A. Fisher. The fiducial argument in statistical inference. Annals of Eugenics, 6:317–324, 1935. [Fisher, 1956] Ronald A. Fisher. Statistical Methods and Scientific Inference. Oliver and Boyd, Edinburgh, 1956. [Hacking, 1965] I. Hacking. The Logic of Statistical Inference. Cambridge University Press, Cambridge, 1965. [Haenni et al., 2009] Rolf Haenni, Jan-Willem Romeijn, Greg Wheeler, and Jon Williamson. Probabilistic Logics and Probabilistic Networks. Springer, 2009. [Hailperin, 1996] T. Hailperin. Sentential Probability Logic. Lehigh University Press, 1996. [Hartmann et al., 2009] S. Hartmann, D. Gabbay, and J. Woods, editors. Handbook for the History of Logic: Inductive Logic (Volume 10). College Publications, 2009. [Howson, 2003] C. Howson. Probability and logic. Journal of Applied Logic, 1(3–4):151–165, 2003. [Jeffreys, 1931] H. Jeffreys. Scientific Inference. Cambridge University Press, , Cambridge, 1931. [Johnson, 1932] W. Johnson. Probability: the deductive and inductive problems. Mind, 49:409– 423, 1932. [Kyburg, 1974] Henry E. Kyburg, Jr. The Logical Foundations of Statistical Inference. D. Reidel, Dordrecht, 1974. [Levi, 1980] Isaac Levi. The enterprise of knowledge: an essay on knowledge, credal probability, and chance. MIT Press, Cambridge MA, 1980. [Maher, 2000] P. Maher. Probabilities for two properties. Erkenntnis, 52:63–81, 2000. [Neyman and Pearson, 1967] J. Neyman and E. Pearson. Joint Statistical Papers. University of California Press, Berkeley, 1967. [Paris and Waterhouse, 2008] J. Paris and P. Waterhouse. Atom exchangeability and instantial relevance. unpublished manuscript, 2008. [Press, 2003] J. Press. Subjective and Objective Bayesian Statistics: Principles, Models, and Applications. John Wiley, New York, 2003. [R´ enyi, 1970] A. R´ enyi. Probability Theory. North Holland, Amsterdam, 1970. [Romeijn, 2004] J.W. Romeijn. Hypotheses and inductive predictions. Synthese, 141(3):333–64, 2004. [Romeijn, 2005] J.W. Romeijn. Bayesian Inductive Logic. PhD dissertation, University of Groningen, 2005. [Romeijn, 2006] J.W. Romeijn. Analogical predictions for explicit similarity. Erkenntnis, 64:253–280, 2006. [Seidenfeld, 1979] T. Seidenfeld. Philosophical Problems of Statistical Inference: Learning from R.A. Fisher. Reidel, Dordrecht, 1979.
774
Jan-Willem Romeijn
[Skyrms, 1991] B. Skyrms. Carnapian inductive logic for markov chains. Erkenntnis, 35:35–53, 1991. [Skyrms, 1993] B. Skyrms. Analogy by similarity in hyper-Carnapian inductive logic. In J. Earman, A. I. Janis, G. Massey, and N. Rescher, editors, Philosophical Problems of the Internal and External Worlds, pages 273–282. University of Pittsburgh Press, Pittsburgh, 1993. [Skyrms, 1996] B. Skyrms. Statistics, Probability, and Game, chapter Carnapian Inductive Logic and Bayesian Statistics, pages 321–336. IMS Lecture Notes, 1996. [Wheeler, 2006] Gregory Wheeler. Rational acceptance and conjunctive/disjunctive absorption. Journal of Logic, Language and Information, 15(1-2):49–63, 2006. [Zabell, 1982] S. Zabell. W. E. Johnson’s “sufficientness” postulate. Annals of Statistics, 10:1091–99, 1982.
Part IX
Various Issues about Causal Inference
This page intentionally left blank
COMMON CAUSE IN CAUSAL INFERENCE Peter Spirtes
1
INTRODUCTION
One of the major impediments to reliably inferring both qualitative and quantitative causal relations from non-experimental data is the possibility that there may be unobserved common causes of observed variables. Suppose, for example, the correlation between two observed variables is measured, e.g. Barometer Reading and Rainfall, and that there is no domain background knowledge about the causal relations between these variables. A qualitative causal question is of the form “Will manipulating the barometer reading (e.g. by rotating a dial showing the barometer reading) affect subsequent rainfall?” A quantitative causal questions is of the form “How much does manipulating the barometer reading to m affect subsequent rainfall?”1 The possibility of unobserved common causes presents no problems for causal inference when simple randomized experiments are possible, beyond the ordinary statistical problems of inferring a population distribution from a sample.2 For example, if it is possible to randomly select a number and set the value of the barometer reading according to the outcome of the randomizing device, it is clearly possible to answer both the qualitative and quantitative causal questions simply by observing the subsequent rainfall, given a large enough sample size. However, in many instances randomized experiments cannot be performed due to ethical considerations, or practical considerations such as the amount of time and money required to perform experiments. This article will consider cases where randomized experiments are not possible. For example, the direct effect of Barometer Reading on Rainfall relative to the set of variables S = {Rainfall, Atmospheric Pressure, Barometer Reading} can ideally be experimentally determined by performing the following two manipulations of variables in S. In both manipulations Atmospheric Pressure is set to the same 1 This example will be revisited in section 3.2.3 in more detail. Although qualitative features of the causal relationships in this example are widely known, it is highly idealized in a number of ways, including not being specific about precisely how the variables are measured. As in many social science models, the example does not specify time indices for the variables, which raises a number of interesting questions about the interpretations of these models [Fisher, 1970]. However, these issues are not particular problems for models with unobserved common causes and will not be discussed in detail here. 2 On the other hand, in sequential randomized trials with unobserved common causes, the estimation of causal effects can be quite intricate [Robins, 1986].
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
778
Peter Spirtes
constant c. In the first manipulation Barometer Reading is set to the constant d, and in the second manipulation, Barometer Reading is set to d + 1. The average difference in Rainfall resulting from these two different manipulations is the direct effect of Barometer Reading on Rainfall relative to the set of variables S. Suppose however that the experiments cannot be performed. Causal modelers in a variety of social sciences have an easy way to answer quantitative causal questions, given (often unrealistically) strong background assumptions. For example, suppose it is assumed what the time order of events is, that the causal relationships are linear, and that there are no unobserved common causes of any pair of variables in S. The direct effect of Barometer Reading on Rainfall relative to S is equal to the regression coefficient relating Barometer Reading to Rainfall when Rainfall is regressed on both Barometer Reading and Atmospheric Pressure (all of the variables in S that temporally precede Rainfall ) in the unmanipulated population (i.e. the population where the barometer dial is not manually set). The coefficient of Barometer Reading in this regression will be zero, which is the correct answer about the direct effect of Barometer Reading on Rainfall relative to S. (See section 3.2.3 for more details on regression.) But, without the assumption that there are no unmeasured common causes, the regression method is not reliable as a method for calculating the direct effect. For example, if Atmospheric Pressure is not observed, and S′ is the set of observed variables {Rainfall, Barometer Reading}, then Atmospheric Pressure is an unmeasured common cause of Barometer Reading and Rainfall. The direct effect of Barometer Reading on Rainfall relative to S′ is still zero. Regressing Rainfall on all of the observed variables (i.e. variables in S′ ) that precede it in time results in a regression of Rainfall on Barometer Reading. Because of the non-zero correlation between Barometer Reading and Rainfall, the regression coefficient when Barometer Reading is regressed on Rainfall is also non-zero. In this case, the regression coefficient is not equal to the direct effect of Barometer Setting on Rainfall relative to S′ . One partial solution to this problem that is commonly employed in many social sciences is to attempt to observe as many potential common causes (i.e. variables occurring prior to the effect) as possible. However, in many cases there is no reliable way to tell if all of the common causes have been observed. Without background knowledge, there is no way to tell whether the regression coefficient of Barometer Reading when Rainfall is regressed on Barometer Reading is nonzero because Barometer Reading causes Rainfall, or because it does not cause Rainfall but some unobserved third variable, e.g. Atmospheric Pressure, causes both Rainfall and Barometer Reading. Even worse, it is not generally recognized that including more potential common causes in a regression can make the difference between the regression coefficient and the direct effect being estimated larger [Spirtes et al., 1998]. See section 3.2.3 for more details and examples. There is no reliable method of causal inference of the effect of manipulating Barometer Reading on Rainfall from the correlation between Barometer Reading and Rainfall alone. This is the truth in the maxim “Correlation is not causation.”
Common Cause in Causal Inference
779
However, with some additional assumptions in many cases reliable causal inference is possible even when there may be unrecorded causes unknown prior to the analysis of the data, and time order is not known. The following sections describe a variety of assumptions and algorithms for reliable causal inference, even when there may be unobserved common causes. There are many types of statistical/causal models that postulate unobserved common causes. These include principal components models, factor analytic models (see section 4.2.4), item response models, some structural equation models, Rasch models, and finite mixture models. (See e.g. [Bartholomew & Knott, 1999; Harman, 1976]). They differ in the families of distributions they represent, and the kinds of constraints they entail on marginal distributions over observed variables— the constraints that help to make possible inferences to the existence and causal roles of unobserved variables. It should be noted that the statistical terminology is ambiguous. For example, “factor analysis,” denotes a family of methods (of dubious reliability) for generating models with latent variables; a “factor analysis” model is any model that might be produced by such a procedure — the very same model, if produced by hand, might be termed a “multiple indicator model”. One and the same statistical model might, or might not, be given a causal interpretation. This article will not describe all of these models in detail, but will use structural equation models (SEMs, section 2) as representative of a large category of causal models with unobserved common causes (including principal components and factor analysis models as a special case).3 At least in publications, common methods of causal inference in the social sciences either assume at the outset that an appropriate causal explanation will involve unobserved common causes, or assume that any common causes are themselves among the measured variables. Such procedures beg important questions: are there unobserved common causes at work? If so, how can they be found, and how can their causal relations be found? What causal relations among the measured variables can be found in spite of unobserved common causes? These questions are the focus of this survey. Section 2 describes structural equation models. Section 3 introduces the concept of a manipulation, in order to clearly distinguish several different kinds of causal inference from statistical inference. Section 4 discusses problems for reliable inference of qualitative causal relations from observational data and background assumptions, and some approaches that have been taken to solving these problems. Section 5 describes some important open problems. Section 6 is an appendix that defines some of the technical terms used in the text.
3 In the epidemiological literature, the problem of inferring the effects of manipulations is described as the problem of inferring potential outcomes. See [Rubin, 1974].
780
Peter Spirtes
2
STRUCTURAL EQUATION MODELS
The set of variables in a structural equation model (SEM) can be divided into two subsets, the “error variables” or “error terms,” and the substantive variables (for which there is no standard terminology in the literature). The substantive variables are the variables of interest, but they are not necessarily all observed. In SEM C, shown in Figure 1, the substantive variables are the variables {X, Y, Z, W }. (The various parts of Figure 1 are explained in more detail below). For example, in a simplified model of money supply in equilibrium [Wyatt, 2004], Y could represent investment, Z could represent money supply, X could represent GDP, and W could represent monetary base. Typically, for each substantive variable X there is a corresponding error term for X that represents all of the causes of X that are not substantive variables.4 Each substantive variable such as X occurs on the left hand side of one equation that relates the value of X to the direct causes of X plus the error term εX on the right hand side of the equation [Bollen, 1989]. Causal Graph X→Y ←Z←W
X := εx
Y := bY
X
Structural Equations · X + bY Z · Z + εY Z := bZ W · W + εZ
W := εW
Examples of Implied Covariances CovC(θ) (Y, W ) = bY Z · bZ W CovC(θ) (Z, W ) = bZ W
Examples of Total Effects Total Effect of W on Y = bY Z · bZ W = CovC(θ) (Y, W ) Total Effect of Y on W = 0 6= CovC(θ) (Y, W ) Figure 1. SEM C For purposes of exposition, the error terms will be assumed to be Gaussian and the structural equations will assumed to be linear, unless explicitly stated otherwise. When the error terms are Gaussian, and the structural equations are linear, all of the substantive variables will also be Gaussian. Without loss of generality, the substantive variables can be transformed to have distribution N (0, 1) (where N indicates the distribution is Gaussian, the first number represents the mean, and the second number represents the standard deviation) and the error terms can be transformed to have zero mean. For example, in SEM C the structural equation for Y is “Y := bY X · X + bY Z · Z + εY ”, where bY X 5 and bY Z are linear structural coefficients, and εY is the error 4 Error terms under a causal interpretation should be distinguished from the residuals from a regression (i.e. the difference between the predicted value from a regression and the actual value of a variable), although under certain conditions they are equal. See section 3.2.3. 5 In general, if Y has a non-zero coefficient in the structural equation for X, it is denoted by
Common Cause in Causal Inference
781
term for Y . Using Lauritzen’s notation (Lauritzen, 2001) the structural equation uses an assignment operator, “:=”, to relate the left hand side and the right hand side. This indicates that the variables on the right hand side are to be interpreted as causes of the left hand side and that the equation can be used not only to describe the relationships between variables in the existing population, but also to predict the effects of manipulating some of the variables (see section 3 for details.) SEMs have two forms — a free parameter form, and a fixed parameter form. In the free parameter form, the linear coefficients in the structural equations (e.g. bY Z ), and the covariance matrix of the error terms (e.g. var(εY )) are variables. In the fixed parameter form, the linear coefficients in the structural equations (e.g. bY Z ), and the covariance matrix among the error terms (e.g. var(εY )) are constants. The context will make it clear whether bY Z refers to a variable (in a SEM with free parameters) or a constant (in a SEM with fixed parameters.) If SEM C has free parameters, C(θ) represents the fixed parameter SEM where the free parameter have been assigned fixed values according to the assignment θ (e.g., θ = {bY Z = bY X = bZW = 0.4, var(εX ) = var(εW ) = 1, var(εY ) = 0.68, var(εZ ) = 0.84}, where in this example the error terms are assumed to be uncorrelated).
!!!!!!!!!!!!!!!!!!!!!!!!"!→!#!←!$!!!!!!!!!!!!%! !!!!!!&!!!!!!!!!!!!!!!!!!!!!!!!!!!!ε$!!!!!!!!!!ε%!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!&! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'(!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"!←!#!←!$!→!%! "!!#!!!$!!%!!!!"!→!#!←!$!←!&!→%!!!"!→!#!←!$!←!%!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"!!!#!!!!$!!%!!! !!!!!!)!!!!!!!!!!!!!!!!!!!!!!!!!!!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!!!!!!!!!!!!!!!!!!!!!!!,!
Figure 2. Alternatives to SEM C For example, in SEM B of Figure 2, if the set of substantive variables is {X, Y, Z, T, W } then the causal graph is X → Y ← Z ← T → W . However, if the substantive variables are {X, Y, Z, W } the corresponding causal graph is shown in B ′ of Figure 2. If T is left out of the substantive variables, then the “other non-substantive causes” of Z and W both contain T , which correlates the error terms of Z and W . The εZ ↔ εW edge is typically taken to imply that there is some unobserved common cause of Z and W , but it is less specific than the Z ← T → W representation of a common cause (because the former does not specify how many common causes there are or there relationships to each other, while the latter does.)
bXY .
782
Peter Spirtes
Let X be the set of all substantive variables in SEM C.6 The structural equations together with the distribution of the error terms in C(θ) entail a probability distribution fC(θ) (X) over the substantive variables X. If the joint distribution is multivariate Gaussian, the distribution fC(θ) (X) is completely characterized by the covariance matrix ΣC(θ) over the substantive variables, and the means µ of the substantive variables. fC(θ) (X) ∼ N (ΣC(θ) , µ) is a function of the fixed parameter values, e.g. covC(θ) (Y, W ) = bY Z · bZ W .7 If, for a given covariance matrix Σ here exists an assignment of parameter values θ of SEM C so that the entailed covariance matrix is Σ (i.e. ΣC(θ) = Σ) say that SEM C represents Σ. If, for each Σ that can be represented by SEM C, there is a unique assignment of values to the free parameters θ such that ΣC(θ) = Σ, then the free parameters of SEM C are identifiable. If the free parameters of SEM C are identifiable, then given a covariance matrix Σ that is represented by SEM C, the values of the free parameters are uniquely determined by Σ. For purposes of illustration, it will be assumed that the graphs are acyclic unless explicitly stated otherwise. The assumptions of Gaussian error terms, linear structural equations, and acyclic graphs simplify the examples while still illustrating the basic problems for causal inference and the basic strategies for solving the problems. Many of the algorithms for causal inference that succeed on this case can be extended to work on more general cases (which will be pointed out in the corresponding discussion.) The following rule explains how to use the graph of a SEM C to calculate CovC(θ) (Y, W ). The rule for calculating covariances from the graph in this way will be helpful in explaining why causal inference with unobserved common causes introduces problems that do not occur when there are no unobserved common causes. A path between X and W is a sequence of adjacent edges starting with X and ending with W , or starting with W and ending with X, where each edge occurs at most once. (A path between X and W is also a path between W and X.) For example, X → Y ← Z ← W is a path in the graph of SEM C between X and W , and also a path between W and X. A directed path from W to Y is a special kind of path in which all of the edges on the path point towards Y . Y ← Z ← W is a directed path from W to Y , but not a directed path from Y to W (because the arrows point towards Y , not W ). X → Y ← Z ← W is not a directed path from W to X because not all of the arrows point towards X; it is also not a directed path from X to W , because not all of the arrows point towards W . A trek between Y and W is either a directed path from Y to W , or a directed path from W to Y , or a pair of directed paths from some variable Z to W and Y that intersect only at Z. (A trek between Y and W is also a trek between W and Y .) There are no 6 Individual
variables are in italics, and sets of variables are in boldface. matrix form, the structural equations are X = BX + ε, where ε is the set of all error terms, and B is the structural coefficient matrix. If the covariance matrix among the ε is Σε , ΣS(θ) = (I − B)−1 Σε (I − B)−1T , where I is the identity matrix, (I − B)−1 is the inverse of (I − B), and (I − B)−1T is the transpose of the inverse of (I − B). 7 In
Common Cause in Causal Inference
783
treks between X and W in SEM C because the only path between X and W is not a directed path from X to W , or a directed path from W to X, and there is no pair of directed paths to X and W from some third variable. However, there is a trek between Y and W , namely Y ← Z ← W (which is also a trek between W and Y ). A trek product is the product of the linear structural coefficients associated with each edge on the trek.8 For example, the trek product of Y ← Z ← W is bY Z · bZ W . It can be shown that CovC(θ) (Y, W ) is the trek sum, i.e. the sum of all trek products between Y and W [Spirtes et al., 2001]. SEM C has only one trek between Y and W , so covC(θ) (Y, W ) = bY Z · bZ W . covC(θ) (X, W ) = 0 because there is no trek between X and W in SEM C. 3
CONDITIONING VERSUS MANIPULATING
The fundamental difference between a statistical but non-causal model and a statistical/causal model is that the latter can be used to both represent a family of probability distributions and calculate the effects of manipulating variables (roughly performing randomized experiments on the variables), while the former can be used to represent a family of distributions but cannot be used to calculate the effects of manipulating variables. This section elaborates on this distinction.
3.1
Conditioning
The probability density function9 of Y conditional on W = m (denoted f (Y |W = m)) represents the density of Y in the subpopulation where W = m, and is defined from the joint distribution as f (Y |W = m) = f (Y, W )/f (W = m) (where f (Y, W ) is the joint density of Y and W , and f (W = m) 6= 0). The conditional density depends only upon the joint density (assuming the values of the variables conditioned on do not have density 0), and does not depend upon the causal relationships. When a variable W is conditioned on, this intuitively represents seeing the value of W . For simplicity, instead of using joint density to illustrate various concepts about conditioning and manipulating, the means of variables will be used instead. The mean of Y conditional on W = m will be denoted E(Y |W = m). In SEM C(θ), it can easily be shown that EC(θ) (Y |W = m) = m· covC(θ) (Y, W ) = m · bY Z · bZ W (where bY Z and bZ W are the structural coefficients in C(θ).) Conditional densities and conditional probabilities are useful for problems (e.g. diagnosis) in which the value of the variable of interest is expensive or difficult 8 This
assumes that the variance of each substantive variable is equal to 1. Otherwise the product of structural coefficient on the trek should also be multiplied by the variance of the unique vertex on the trek that has no edges pointing into it (the source of the trek.) 9 A probability density function represents a probability distribution for continuous variables in terms of integrals. In the case of Gaussian variables, it is completely specified by the covariance matrix and the means of the variables.
784
Peter Spirtes
to measure, but other related variables that have information about the variable of interest are more easily observed. For example, the fact that the probability of measles conditional on spots is much higher than the probability of measles conditional on no spots is useful information in diagnosing whether someone has measles, since directly measuring the presence of measles is much more difficult and expensive than observing that they have spots.
3.2
Manipulating
In contrast to conditioning, a manipulated probability density is not a density in a subpopulation of the population, but is a density in a (possibly) hypothetical population formed by manipulating the value of a variable or variables in a causal system. For example, in an experiment in which W is randomized (which might be impossible to do in practice), whatever causes of W existed in the existing population are replaced by the outcome of a randomization device as the sole cause of W . In contrast to conditioning, which corresponds to seeing, manipulating corresponds to doing [Pearl, 2000]. Intuitively, the density of rainfall after seeing that barometer reading is m (conditioning on the barometer reading) is different than the density of rainfall after manipulating the barometer reading to m. (The assumption is made that the manipulation is ideal. The only direct effect of an ideal manipulation is on the variable being manipulated — any other effects are due to the change in the variable being manipulated.) If there is a structural equation model that correctly represents the causal structure of the general population, then manipulating the value of a variable W is represented by a new structural equation model. The new structural equation model replaces the old structural equation for W by a new structural equation that relates W to the outcome of the randomizing device; in addition, all of the other structural equations are left the same [Strotz and Wold, 1960; Spirtes et al., 2001; Pearl, 2000]. For example, if SEM C is true, then the effect of randomizing Y to some new distribution can modeled by replacing “Y := bY X ·X+bY Z ·Z+εY ” with “Y := ε′Y ”, where ε′Y represents the output of the randomizing device. When Y is manipulated this way, a new model C(θ′ ) is created out of C(θ), where θ′ = θ, except that in θ′ , var(εY ) is replaced by var(ε′Y ), and bY W and bY X are set to 0. C(θ′ ) has a new set of structural equations, a new entailed distribution over the substantive variables, a new entailed covariance matrix, and a new causal graph that represents the new structural equations. The causal graph for C(θ′ ) is the same as the original causal graph for C(θ), except that all of the edges coming into the manipulated variable Y are removed because those variables have coefficients fixed at zero in the new set of structural equations. (In addition to the kinds of manipulations discussed here, there are more general kinds of manipulations. See e.g. [Pearl, 2000; Spirtes et al., 2001].) Since by design the outcome of the randomizing device is not caused by any of
Common Cause in Causal Inference
785
the other variables in the system, and is a direct cause only of Y , the randomizing device can be treated as an exogenous error term, and Y is equal to the error term. In the degenerate case, a manipulation can set every member of a population to have the same level of Y , in which case the structural equation that describes Y in the experimental population sets the level of Y to a constant (i.e. it sets the variance of the error term to zero.) The result of modifying the set of structural equations in this way can lead to a density in the randomized population that is not necessarily the same as the density in any subpopulation of the population. (For more details see [Spirtes et al., 2001; Pearl, 2000]). The direct effect of Y on X relative to the set of variables S represents the following experimental quantity. Perform two separate manipulations of all of the variables in S except for X. In both manipulations all of the variables in S except for X and Y are set to the same constant c. In the first manipulation Y is set to the constant d, and in the second manipulation, to d + 1. The average difference in X resulting from these two different manipulations is the direct effect of Y on X relative to the set of variables S. The direct effect of Y on X relative to S = {X, Y } (that is only Y is manipulated) is the total effect of Y on X. If the direct effect of Y on X relative to S is non-zero, then Y is a direct cause of X relative to S. If SEM C is true, W is not a direct cause of Y relative to {X, Y, Z, W }, because two manipulations that disagree on the value assigned to W but agree on the values assigned to X and Z produce the same average value of Y . On the other hand, relative to {X, Y, W }, W is a direct cause of Y , because two manipulations that disagree on the values assigned to W but agree on the values assigned to X, produce different average values of Y . (For a discussion of total and direct effects see [Bollen, 1989]). If the total effect of Y on X is non-zero, then Y is a cause of X.10 A set S of variables is causally sufficient if every variable that is a direct cause (relative to S) of any pair of variables in S is also in S. Intuitively, a set of variables is causally sufficient if no common causes of pairs of variables in the set have been left out of the set. If SEM C is true, for S = {X, Y, Z, W }, S and every subset of S is causally sufficient. In contrast, suppose SEM B (shown in Figure 2) with causal graph X → Y ← Z ← T → W is true. In that case S is not causally sufficient because T is a direct cause of W and Z relative to S, and both W and Z are in S. As pointed out in section 2, given the non-causally sufficient set of variables S, a SEM such as B ′ of Figure 2 that correctly represents both the population covariance matrix and the causal relations requires the introduction of a correlated error between εZ and εW .
10 This terminology has the counter-intuitive consequence that Y can fail to be a cause of X, but can still be a direct cause of X relative to a particular set of variables S. There is an edge Y → X in the causal graph of a SEM when Y is a direct cause of X relative to the variables in the SEM even if Y is not a cause of X.
786
3.2.1
Peter Spirtes
Total Effects in SEM C
Adapting Lauritzen’s notation [Lauritzen, 2001] the mean of Y when W is manipulated to the constant value m will be written as E(Y ||W = m), where the double bar distinguishes this operation from the single bar used to denote conditioning. EC(θ) (Y ||W = m) represents the mean of Y that is entailed by C(θ) when W is manipulated to have the value m, by replacing the structural equation for W in S(θ) with the new structural equation W = m. In the case of linear models, there is an easy way to calculate the total effect of Y on W as long as the causal graph is acyclic. The total effect of Y on W along a given directed path from W to Y is simply the product of the edge coefficients along the directed path. For example, in SEM C, the effect of W on Y along the directed path Y ← Z ← W is bY Z · bZ W . The total effect of W on Y is the sum over all directed paths from W to Y of the effect of W on Y along each directed path (the directed path sum). Since in SEM C there is only one directed path from W to Z, the total effect of W on Y is also bY Z · bZ W . (The manipulated mean of Y when W is manipulated to m and the pre-manipulation mean of Y is zero, is then equal to m times the total effect, or m · bY Z · bZ W. .) One intuitive consequence of this method of calculating manipulated means is that if Y is not a cause of W (i.e. there is no directed path from Y to W in the causal graph), then the manipulated mean and the unmanipulated mean of W are the same. In SEM C, EC(θ) (Y ||W = m) = m·bY Z · bZW = EC(θ) (Y |W = m). This follows from the fact that covC(θ) (Y, W ) (used to calculate the conditional mean) is equal to the trek sum, and the total effect (used to calculate a manipulated mean) is equal to a directed path sum. However, in SEM C, the treks and the directed paths are the same, i.e. Y ← Z ← W . Hence the covariance and the total effect are the same, and the conditional and manipulated means are the same. In contrast, by assumption EC(θ) (Y ) = 0; so in this case, both the manipulated mean (EC(θ) (Y ||W = m)) and the conditional mean (EC(θ) (Y |W = m)) are different from EC(θ) (Y ). This is intuitively correct, since according to C(θ), manipulating the value of W (doing) changes the value of Y because it is a cause of Y , and conditioning on (seeing) W contains information about the value of Y . 3.2.2
Total Effects in SEM B
Suppose that SEM B is the alternative SEM of Figure 2. Let Σ be the covariance matrix over the observed variables X, Y, Z, W . It can be shown that for every θ of SEM C, there is a θ′ of SEM B such that ΣB(θ′ ) = ΣC(θ) , i.e. for every assignment θ of values to the free parameters of SEM C there is an assignment θ′ of values to the free parameters of SEM B such that the covariance matrix entailed by B(θ′ ) is the same as the covariance matrix entailed by C(θ) over the observed variables. Suppose then that θ′ is chosen so that ΣB(θ′ ) = ΣC(θ) . Intuitively, the common cause T of Y and W contributes to the covariance between Y and W , but there is no causal influence of W on Y because there is no directed path from W to Y . The reason that in SEM B the conditional mean
Common Cause in Causal Inference
787
is not equal to the manipulated mean is that in SEM B there is a trek between W and Y (Y ← Z ← T → W ) that is not a directed path. So the trek product over Y ← Z ← T → W contributes to the covariance, and hence the conditional mean, but it does not contribute to the total effect of W on Y , and hence does not contribute to the manipulated mean. Slightly more formally, EB(θ′ ) (Y |W = m) = m · covB(θ′ ) (Y, W ), which is equal to m times the sum of trek products between W and Y . Y ← Z ← T → W is a trek in B, and so EB(θ′ ) (Y |W = m) = m · bY Z · bZ T · bW T . In contrast, the total effect of W on Y according to B is the sum of directed path products from W to Y . However, there are no directed paths from W to Y in B, so the total effect is zero. In summary, EB(θ′ ) (Y ||W = m) = EB(θ′ ) (Y ) = 0 6= EB(θ′ ) (Y |W = m). One lesson to be drawn is that reliably estimating the quantitative effect of one variable on another (e.g. W on Y ) requires first knowing the qualitative causal relationships of the variables represented by the graph (e.g. whether W causes Y as in SEM C, or W does not cause Y as in SEM B). 3.2.3
Regression and Structural Coefficients
One way to predict the value of Y from a set of variables O is to regress Y on O. For example, suppose that given a covariance matrix Σ the goal is to predict the value of Y after having observed values of O = {X, Z, W }. Regression is a procedure that takes Σ as input and outputs a linear equation “Yˆ = rY Z.O · Z + rY X.O · X + rY W.O · W ”, where Yˆ is the predicted (estimated) value of Y from the values of X, Z, and W , and rY Z.O is a real number that is the partial regression coefficient of Z when Y is regressed on O. It can be shown that if X = x, Z = z, and W = w then E(Yˆ ) = rY Z.O · z + rY X.O · x + rY W.O · w = E(Y |X = x, Z = z, W = w). So regression gives an unbiased estimate of the conditional mean.11 The output of the regression depends only on Σ, and is unaffected by how the variables are causally related. Under certain assumptions about the causal relations between X, Y, Z, and W , the partial regression coefficients are also equal to linear structural coefficients associated with edges in a causal graph of a SEM [Spirtes et al., 1998b]. In those cases, the regression coefficients can be used not only to estimate the conditional mean, they can also be used to estimate manipulated means. In a directed graph of SEM C, Z is a descendant of Y (and Y is an ancestor of Z) if there is a directed path from Y to Z. ND(C, Y ) is the set of all non-descendants of Y in the graph of SEM C, other than Y itself. For example, ND(C, Y ) = {X, W, Z} and ND(C, X) = {W, Z}. In a SEM C over a causally sufficient set of variables (without correlated errors and a graph that is acyclic) if X is a non-descendant of Y in the causal graph (i.e. a non-effect of Y ) then the structural coefficient bY X (i.e. the coefficient of X in the structural equation for Y ) is equal to rY X.ND(C,Y ) , 11 An unbiased estimator is defined in the Appendix. The regression equation gives the best unbiased linear predictor of Y from X, Z, and W .
788
Peter Spirtes
that is the coefficient of X when Y is regressed on ND(C, Y ).12 For example, to find bY X and bY Z in C(θ), regress Y on X, Z, and W using as input ΣC(θ) . It follows that rY Z.ND(Y ) = bY Z , rY X.ND(Y ) = bY X , and rY W.ND(Y ) = bY W . (In SEM C, bY W is zero because W is not a parent of Y , i.e. there is no edge W → Y in the graph of C.) The use of regression coefficients to predict Y from conditioning (seeing) values of other variables does not require knowledge of causal relations. In contrast, the use of regression coefficients to estimate structural coefficients (and hence to predict the effects of manipulations) does require knowledge of causal relations between the variables; i.e. it requires knowing which variables are not effects of Y , and that the set of variables is causally sufficient. Suppose first that J
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"#$%&'()*+,-)(%%.)(, ! ! /0)$#("(),1(02*34,→,10*35066,,,,/0)$#("(),1(02*34,!!!!,10*35066, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, , ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,9, Figure 3. SEMs J and K in Figure 3 is the true SEM. According to J, the total effect of Barometer Reading on Rainfall is the directed path sum and is equal to the structural coefficient b, which is the coefficient resulting from regressing Rainfall on Barometer Reading. If SEM K is true, then the total effect of Barometer Reading on Rainfall is equal to the coefficient of Barometer Reading in the structural equation for Rainfall (which is zero) and is equal to the coefficient resulting from regressing Rainfall on Barometer Reading and Atmospheric Pressure. Without knowing which of J or K is true, it is not possible to determine which regression gives the true direct (and in this case total) effect. Unfortunately, given just the covariance between Barometer Reading and Rainfall, there is no way to determine which of J or K is true. And if K is true, and Atmospheric Pressure is unmeasured, then there is no way to perform the regression of Rainfall on Barometer Reading and Atmospheric Pressure. This problem would not arise if it were known that there are no unobserved common causes of Rainfall and Barometer Reading; in that case 12 It is only necessary to regress Y on its parents in SEM C for the regression coefficients to equal the structural coefficients. However, in many circumstances, it is not known which variables are parents of Y , but it is known which variables are non-descendants of Y (e.g. all variables that occur earlier than Y ). When the regression coefficients are equal to the structural coefficients, the regression residuals (the difference between the predicted and actual values of Y ) can be interpreted as due to unmeasured causes of Y , and hence are equal to error terms under a causal interpretation.
Common Cause in Causal Inference
789
the only SEM compatible with background assumptions (about time order) is J, and the total effect of Barometer Reading on Rainfall can be reliably estimated.13 One practice that causal modelers in the social sciences use to try to partially solve this problem is to observe as many potential common causes of Rainfall and Barometer Reading as possible (e.g. variables that occur prior to Rainfall and Barometer Reading), and regress on all of them. It is well known that regressing on many variables can create various statistical problems at small sample sizes, but at large enough sample sizes, and assuming that there are no deterministic relationships among the measured variables, these statistical problems can be overcome. Unfortunately, it is generally not possible to reliably test when all of the common causes of Rainfall and Barometer Reading have been observed (although with some additional assumptions, in some cases the FCI algorithm described in section 4.2.3 can reliably detect test this). In addition, regressing on additional potential common causes can actually make the difference between the structural coefficient and the regression coefficient larger [Spirtes et al., 1998b]. For example, in Figure 4, if Z is temporally prior to Y , simple calculations show bY X = rY X 6= rY X.Z even though Z is a potential common cause. This is because {X, Y } is causally sufficient, but {X, Y, Z} is not. T1
T2 Z
X
Y
Figure 4. Adding more variables to regression can make estimates more biased
4
ESTIMATING MANIPULATED MEANS
One goal of causal inference is to obtain a qualitative understanding of the causal relationships between variables, and how a population distribution is generated. In that case, inference of the causal graph, or as much about the causal graph that can be reliably discovered is the goal. A second goal is to obtain quantitative information about causal relations between variables — e.g. if W is manipulated to m, what will the mean of Y become? Estimates of quantitative causal relations typically require first inferring qualitative causal relationships, so both issues will be discussed together. 13 In general, if a set of variables in a SEM G is causally sufficient, the structural coefficient bY X is equal to the partial regression coefficient rY X.O , for any set O that contains X, as long as O does not contain any descendant of Y , and X is d-separated from Y conditional on O in the graph of G with the X → Y edge removed [Spirtes et al., 1998b]. D-separation is explained in section 4.1.1.
790
Peter Spirtes
The following discussion concentrates on estimating14 manipulated means, but similar strategies can be used to estimate other causal quantities. As discussed in section 3.2.1., EC(θ) (Y ||W = m) = m·bY Z ·bZ W , while EB(θ’) (Y ||W = m) = 0. So the mean of Y when W is manipulated to m is a function of which SEM is true, and of the values of the parameters in the SEM. One way to estimate the manipulated mean, which will be discussed in detail in this section, is to (i) use the sample to find the correct SEM (e.g. SEM C), (ii) find a formula for EC(θ) (Y ||W = m) in terms of the free parameters of the correct SEM (e.g. m · bY Z · bZ W ), (iii) estimate the values of the free parameters of the SEM from the data (e.g. ˆbY Z and ˆbZ W ) and (iv) substitute the estimated values into the formula for the manipulated mean (e.g. m · ˆbY Z · ˆbZ W ). The actual process is a little more complicated because without a known time order, and with the possibility of unmeasured common causes, typically it is not possible to reliably find the true SEM; rather is only possible to reliably find a set of SEMs that contains the true SEM. The principles of estimating the effects of manipulations can be illustrated with the 5 causal graphs of Figure 2. The first step is to search among the SEMs and select the best SEMs or set of SEMs. In order to use sample data to select SEMs, it is necessary to make some assumptions that link SEMs (and in particular the graphs of SEMs) to probability distributions. The Causal Markov Assumption is implicit in much of the practice of structural equation modeling. (For extended discussions of the Causal Markov Assumption, see [Spirtes et al., 2001; Cartwright, 1994; Glymour, 1999]). 4.1.1
Causal Markov Assumption
X and Y are independent in a probability density function f , denoted If (X, Y) when f (X|Y) = f (X). (The f subscript will be left out of If when it does not introduce any ambiguity.) Intuitively, this means that the values of the variables in Y contain no information about the values of the variables in X. In the case of Gaussian variables, X and Y are independent if and only if each variable X ∈ X and Y ∈ Y, cov(X, Y ) = 0. Similarly, X is independent of Y conditional on Z (denoted If (X, Y|Z)) when f (X|Y, Z) = f (X|Z). Intuitively, this means that the values of the variables in Y contain no information about the values of the variables in X, once the values of the variables in Z are already known. In the case of Gaussian distribution f , X and Y are independent if and only if each variable X ∈ X and Y ∈ Y, covf (X, Y |Z) = 0 (where covf (X, Y |Z) is the covariance between X and Y conditional on Z in distribution f ). For Gaussian distributions, the conditional covariance does not depend upon the particular value of Z conditioned on (although the conditional mean does). IS(θ) (X, Y|Z) denotes that X is independent of Y conditional on Z in the density over the substantive variables entailed by S(θ). IS (X, Y|Z) denotes that a SEM S (with free parameters) entails IS(θ) (X, Y|Z) for all assignment of values θ to its free parameters; in other words X is independent of Y conditional on Z in 14 See
section 5 for a technical appendix defining estimators and their properties.
Common Cause in Causal Inference
791
every distribution represented by S. If IS (X, Y|Z) then S is said to entail that X and Y are independent conditional on Z. Since IS (X, Y|Z) does not depend upon the parameter values of S, it is possible to determine whether IS (X, Y|Z) from the graph of S; in this case the graph of S is also said to entail that X and Y are independent conditional on Z.15 (In cases where it does not create confusion, the graph of S will simply be referred to as S). It is possible that I(θ) (X, Y|Z) for some, but not all θ (see an example in section 4.1.3). In general the fact that it is not the case that IS (X, Y|Z) does not mean that there is no assignment of values θ to the free parameters for which IS(θ) (X, Y|Z)) — it entails only that it is not the case that IS(θ) (X, Y|Z)) for all θ. The following assumption is used to relate causal relations to probability distributions. Weak Causal Markov Assumption: For a causally sufficient set of variables V in a population N , if no variable in X causes any variable in Y, and no variable in Y causes any variable in X, then X and Y are independent (i.e. in the Gaussian case, members of X and members of Y are pairwise uncorrelated.) The Weak Causal Markov Assumption has the consequence that the error terms for causally sufficient sets of variables are independent. Simon’s famous analysis of “spurious correlation” [Simon, 1985] is precisely an application of the Weak Causal Markov Assumption to explain correlated errors as the result of unobserved common causes. The examples that Bollen gives of why an error term for a variable X might be correlated with one of the causes of X other than sampling problems are all due to causal relations between the error term and other causes of X, and hence an application of the Weak Causal Markov Assumption [Bollen, 1989]. (For a discussion of the Causal Markov Assumption, and conditions under which it should not be assumed, see [Spirtes et al., 2001]). For deterministic causal models such as SEMs, the Weak Causal Markov Assumption also entails another version of the Causal Markov Assumption, i.e. that for causally sufficient sets of variables, all variables are independent of the their non-effects (non-descendants in the causal graph) conditional on their direct causes (parents in the causal graph) [Spirtes et al., 2001].16 Some of the causal discovery algorithms described later that use the Causal Markov Assumption merely assume that causally sufficient sets of variables exist, not that the causally sufficient sets of variables are all observed, i.e. there can be unobserved common causes. The SEMs in Figure 2 illustrate the use of this 15 In acyclic graphs with no double-headed arrows, X is d-separated from Y conditional on Z is a purely graphical relationship among three disjoint sets of variables X, Y, and Z such that X is d-separated from Y conditional on Z in DAG G if and only if G entails that X and Y are independent conditional on Z. A variable A is a collider on a path if the path contains edges B → A ← C. For disjoint sets X, Y, and Z, X is d-separated from Y conditional on Z in G if and only if there is no path between any X ∈ X and any Y ∈ Y such that every collider on the path is in Z or has a descendant in Z, and no non-collider on the path is in Z. (See [Pearl, 1988] for details). d-separation can also be extended to graphs with cycles and double-headed arrows. 16 For non-detertministic causal models, the alternative Causal Markov Assumption is usually made directly. (For different alternative versions of Markov relations, see [Lauritzen et al., 1990]).
792
Peter Spirtes
consequence of the Weak Causal Markov Assumption in causal inference. Suppose that SEM C of Figure 2 has the true causal graph. In SEM C, CovC (W, Z) = bZ W , so ∼ I(W, Z) as long as bZ W 6= 0. In contrast, IA (W, Z), so A cannot represent the population distribution at all, regardless of what parameter values it is assigned. So, even for the assignment θ of parameters to A that most closely matches the population distribution, a statistical test will reject A(θ) at a large enough sample size. In contrast, it is not possible to eliminate SEM D from consideration using just the Causal Markov Assumption as long as T is not observed. If SEM D is true, the Causal Markov Assumption does not entail any zero covariance or conditional covariances among X, Y, W , and Z. However, just because there are no zero conditional covariance among X, Y, W , and Z that hold for all values of the free parameters of D does not imply that there are no zero conditional covariance relations among X, Y, W , and Z that hold for some values of the free parameters of D. Indeed, for any assignment of parameter values θ to C there exists an assignment of parameter values θ′ to D that represents the same distribution over the observed variables X, Y , Z, and W .17 Hence, at least for some parametric families (e.g. Gaussian, multinomial) in order to reliably draw conclusions about effects of manipulations from observational data, some additional assumptions must be made. (This is shown more formally in [Spirtes et al., 2001; Robins et al., 2003]). The next section will describe one such assumption. 4.1.2
Causal Faithfulness Assumption
In any SEM over a causally sufficient set of variables, there is an adjacency between two variables X and Y if and only if the SEM does not entail that X and Y are independent conditional on any subset Z of the other variables [Spirtes et al., 2001]. Even though SEM C entails that X and Z are independent, and SEM D does not (because SEM D contains an edge from Z to X), nevertheless there could be an assignment of parameter values to SEM D that has the same distribution as C(θ). That is, it could be that ID(θ′ ) (X, Z) for some θ′ (i.e. covD(θ′ ) (X, Z) = 0) even though it is not the case that ID (X, Z) (i.e. it is not the case that D entails cov(X, Z) = 0 for all values of the free parameters). CovD(θ′ ) (X, Z) = βXZ + (βY Z · βXY ), so if covD(θ′ ) (X, Z) = 0, then βXZ = −(βY Z · βXY ). Although C and D agree on the total effect of Z on X (zero according to both), they disagree on the total effect of Y on X (zero according to SEM C, β XY according to SEM D). So if it is observed that I(X, Z) there are at least two possible explanations, SEMs Cand D. There are several arguments why, in the absence of evidence to the contrary, C should be the preferred explanation. 17 However, if at most one error term is non-Gaussian, but the structural equations are linear, then SEM D cannot represent any distribution represented by SEM C, and hence can be eliminated from consideration without any further assumptions [Shimizu et al., 2006].
Common Cause in Causal Inference
793
SEM C explains the independence of X and Z as a consequence of no causal connection between the variables. In contrast SEM D explains the independence as a consequence of a large direct effect of Z on X cancelled exactly by a large indirect effect of Z on X (via the effect on Y ). But sciences typically assumes that, unless there is evidence to the contrary, an improbable and unstable cancellation of parameters (as in SEM D) does not hide real causal influences. When a theory cannot explain an empirical regularity save by invoking a special parameterization, most scientists are uneasy with the theory and look for an alternative [Glymour, 1980]. Second, this cancellation is improbable (in the sense that if a zero conditional correlation is not entailed, the measure of the set of free parameter values for any DAG that lead to such cancellations is zero for any “smooth” prior probability distribution18 e.g. Normal, exponential, etc., over the free parameters). Finally, SEM C is simpler than SEM D. SEM C has fewer free parameters than SEM D. Because SEM D imposes a proper subset of the conditional independence constraints imposed by SEM C, SEM D represents a proper superset of the distributions that can be represented by SEM C. There is a even a well defined sense in which the set of distributions represented by SEM D is of higher dimension than the set of distributions represented by SEM C [Geiger et al., 2002]. So SEM D is more complex than SEM C in a precisely defined way. The assumption that a causal influence is not hidden by coincidental cancellations can be expressed as following for SEMs. A probability density function f is faithful to the graph G of a SEM if and only if every conditional independence relation true in f is entailed by G. Causal Faithfulness Assumption: For a causally sufficient set of variables V in a population N ,, the population distribution is faithful to the causal graph for N .19 The Causal Faithfulness Assumption requires preferring SEM C to SEM D, because parameter values θ′ for which ID(θ′ ) (X, Z) would violate the Causal Faithfulness Assumption. The Causal Faithfulness Assumption limits the SEMs considered to those SEMs in which population conditional independence constraints are entailed by causal structure, rather than by particular values of the parameters. There can be sometimes be good reasons to believe that the more complicated model is true, and the reason that a conditional independence relation holds is not due to the structure of the graph, e.g. when there are deterministic relationships among the substantive variables, equality constraints upon free parameters [Spirtes et al., 2001], or cases where policy makers are intentionally optimizing 18 A
smooth measure is absolutely continuous with Lebesgue measure. assumption can be considerably weakened and still permit reliable causal inference, but the weaker assumption requires more complicated algorithms with more complex and somewhat less informative output. The weakened assumption is closer to the informal intuition. The Adjacency Causal Faithfulness Assumption states: For a causally sufficient set of variables S, if X is a direct cause of Y (so X and Y are adjacent in the causal graph for population N ), then for any Z ⊆ S, ∼ I(X,Y |Z) in population N [Ramsey et al., 2006]. 19 The
794
Peter Spirtes
the values of variables [Hoover, 2001]. In those cases, the Causal Faithfulness Assumption should not be made. 4.1.3
Markov Equivalence-Over-O Classes
According to SEM B, IB (Z, W |T ). However it is not the case that IC (Z, W |T ) because SEM C does not contain the unobserved variable T . This conditional independence is not directly testable, because T is not observed. However, SEMs B and C entail the same set of conditional independence relations among just the observed variables O = {X, Y, Z, W }. SEMs B and C are Markov equivalentover-O if they entail the same set of conditional independence relations among the variables in O. Hence when O = {X, Y, Z, W } the Markov equivalence-overO class contains SEM B as well as SEM C. It is easy to see that the Markov equivalence-over-O class contains an infinite number of SEMs, because adding more and more unobserved common causes of X and Y does not change the conditional independence relations entailed over O. (If two SEMs both have the same set of substantive variables V (observed or unobserved), and they are Markov equivalent-over-V, they are said to be Markov-equivalent.) If SEM C is true, SEM A can be eliminated from consideration by the Causal Markov Assumption, and SEM D can be eliminated from consideration by the Causal Faithfulness Assumption. However, neither of these assumptions eliminates SEM B from consideration. Since SEM B and SEM C entail the same set of conditional independence relations over O, it is not possible to eliminate SEM B from consideration without either adding more assumptions or background knowledge, or using features of the probability distribution that are not conditional independence relations. Without a principled way to choose between SEMs B and C, the best a search algorithm can reliably do is to return both SEM B and SEM C, as well as the other graphs Markov equivalent to SEMs B and C, as possibilities. SEMs B and C are distribution equivalent-over-O if and only if for any assignment of parameter values θ to C there exists an assignment of parameter values θ′ to B that entails the same marginal distribution over O, and vice versa. (If two SEMs have the same set of substantive variables V (observed or unobserved) and they are distribution equivalent-over-V, they are said to be distributionequivalent.)20 If all of the error terms are Gaussian, then SEMs B and C are distribution equivalent-over-O as well as Markov equivalent-over-O. In such cases, the best that a reliable search algorithm can do is to return the entire Markov equivalence-over-O class, regardless of what features of the marginal distribution that it uses.
20 Strictly speaking, distribution equivalence over-O is relative to a particular parametric family. So two SEMs that are distribution-equivalent-over-O for a parametric family that allows only Gaussian error terms are not distribution-equivalent-over-O for a parametric family that allows non-Gaussian error terms.
Common Cause in Causal Inference
795
In contrast, SEMs D and E are Markov equivalent-over-O, but they are not distribution equivalent-over-O. SEM E, unlike SEM D, entails that CovE (X, Z) · CovE (Y, W ) = CovE (X, W ) · CovE (Y, Z) and CovE (Y, Z) cannot all be negative (this follows easily from the trek sum rule for calculating covariances.) This is a constraint on the marginal distribution over O that is not a conditional independence constraint. So when unobserved common causes are allowed, SEMs that are Markov equivalent-over-O are not necessarily distribution equivalent-over-O even when all of the error terms are Gaussian and the structural equations are linear. When Markov equivalence-over-O fails to entail distribution equivalence-overO, then using conditional independence relations alone for causal inference is still correct, but it is not as informative as theoretically possible. For example, assuming causal sufficiency and non-Gaussian errors [Shimizu et al., 2006], conditional independence tests can at best reliably determine the correct Markov equivalence class, while using other features of the sample distribution (higher order moments of the distribution) can be used to reliably determine a unique graph. The situation is analogous to linear SEMs and Gaussian errors, but without causal sufficiency. Using conditional independence tests it is possible to reliably infer a Markov equivalence-over-O class of DAGs; however, because there are other features of the distribution over O that are not being used, this set is theoretically not as small as possible. For example, if CovE (X, Z) · CovE (Y, W ) 6= CovE (X, W ) · CovE (Y, Z) are all negative then SEM E can be eliminated from consideration, even if all of the conditional independence relations entailed by E (none in this case) are true in the population. The difficulty is that in many cases it is not known how to use the extra information contained in the density to reliably narrow down the set of DAGs output. However, in some cases it is known how to use the extra information (see section 4.2.4). There are algorithms for testing when two SEMs are Markov equivalent-overO [Spirtes and Richardson, 1996], although they are computationally much more intensive than comparable algorithms for testing when two SEMs without unobserved common causes are Markov equivalent. While there are recently developed algorithms for determining when two SEMs are distribution equivalent-over-O, they are so computationally intensive they are only practical for SEMs with a few variables [Geiger and Meek, 1999]. 4.1.4
Partial Ancestral Graphs — Representing Markov Equivalence-over-O Classes
A partial ancestral graph (PAG) [Spirtes et al., 1995; Ali et al., 2005; Zhang, 2007] represents a Markov equivalence-over-O class of DAGs. X and Y are adjacent in a PAG that represents a Markov equivalence-over-O class if X and Y are not entailed to be independent conditional on any subset of O. The edge contains a “o” endpoint on the X end if for some DAGs in the Markov equivalence-over-O class, X is an ancestor of Y and in other DAGs it is not an ancestor of Y ; the edge contains a tail at X if and only if X is an ancestor of Y in every DAG in the Markov equivalence-over-O class; and the edge contains an arrowhead at X if and
796
Peter Spirtes
!"!# # !"$#"#############%"# #
&%#
# '(!
Figure 5. College Plans PAG only if X is not an ancestor of Y in every DAG in the Markov equivalence-over-O class. An example of a PAG that was constructed from data measured on the college plans of high school students is shown in Figure 5.21 SES is socio-economic status, PE is parental encouragement, CP is college plans, and IQ is intelligence quotient [Spirtes et al., 2001; Sewell and Shah, 1968]. The double-headed arrow between CP and SES indicates that there is an unobserved common cause of SES and CP. The PE → CP arrow indicates that PE is a direct cause (relative to the observed variables) of CP, and that there are no unobserved common causes of PE and CP. The edge SEX o→ PE indicates that PE is not a cause of SEX, but the statistical information is compatible with SEX being a cause of PE, or there being an unobserved common cause of SEX and PE. 4.1.5
Scoring SEMs
Together, the Causal Markov and Causal Faithfulness Assumptions entail a number of conditional independence facts about the population distribution. However, even if the conditional independence relations hold in the population, they typically will not hold in the sample. For a sample, what is a good measure of how compatible a set of conditional independence relations that are entailed by a SEM are with the sample data? A similar question can be asked of those parametric families that entail constraints that are not conditional independence constraints. Under the assumption of no unmeasured common causes, there are a number of different scores of fit that that are consistent in the following sense: under the Causal Markov and Causal Faithfulness Assumptions, in the large sample limit, the set of SEMs with the highest score is guaranteed to contain the true SEM with probability 1. For example, the Bayes Information Criterion is a penalized likelihood score that rewards a SEM for assigning a high likelihood to the sample 21 This model created is not a SEM because the variables were discretized, but the causal relations can be represented by a PAG and the PAG can be created with the same FCI algorithm (section 4.2.3) using the appropriate tests for conditional independence for discrete variables.
Common Cause in Causal Inference
797
for the maximum likelihood estimate of the parameters, and penalizes a SEM that represents a set of probability distributions that has high dimension [Haughton, 1988; Chickering, 2003]. In the case of a multi-variate Gaussian SEM M , for a given sample of size n BIC(M , sample) = −2L(ΣM (θ) ˆ , sample)+ln(n) · dfM , where • θˆ is the maximum likelihood estimate of the parameters for model M from sample; • L(ΣM (θ) ˆ , sample) is the likelihood of ΣM (θ) ˆ ; and • dfM is the degrees of freedom (dimensionality) of the SEM M. While the use of penalized likelihood scores, such as BIC, for SEMs with no unobserved common causes is not problematic, there are major statistical problems in scoring models using penalized likelihood scores for SEMs with unobserved common causes. In order to calculate a BIC score it is necessary to calculate a maximum likelihood estimate of the SEM parameters, and the dimensionality of the set of marginal distributions over the observed variables represented by the SEM. However, with the exception of a few distributions such as Gaussian, or multinomial, even when the joint distribution falls into a family of distributions that is well understood, the marginal distributions will not. In cases where the marginal distribution is a member of a well understood family of distributions, the parameters of the SEM may not be identifiable at all, and hence it is not possible to find a maximum likelihood estimate of the parameters from the data over the observed marginal. Furthermore, even in those cases where the maximum likelihood estimates can be calculated (such as SEM E in Figure 2) the actual calculations typically involve an iterative hill-climbing algorithm that is much more computationally expensive than regression and can get stuck in local maxima. In addition, there are both theoretical and practical difficulties in calculating the dimensionality of the set of marginal probability distributions that are represented by a SEM with unobserved common causes. The dimensionality is not well defined for some values of the parameters, and is difficult to calculate even when it is well defined [Geiger et al., 2002]. This problem is caused by the fact that while the unobserved conditional independence relations (those that involve the unobserved common cause) cannot be tested directly, they can nevertheless entail constraints on the marginal distribution that are not conditional independence relations. These non-conditional independence constraints present both a problem and an opportunity. On the one hand, if it is known how to use them to eliminate some graphs from consideration (as in certain special cases) then they strengthen the causal inferences that can be made. For example, if Cov(X, Z) · Cov(Y, W ) 6= Cov(X, W ) · Cov(Y, Z) are
798
Peter Spirtes
all negative, then SEM E cannot be the correct SEM, even if it is compatible with all of the conditional independence relations over O. On the other hand, the non-conditional independence constraints are the reason that the maximum likelihood estimate can get stuck in local maxima and that the dimensionality of the marginal distributions represented by a SEM with unobserved common causes is sometimes undefined or difficult to calculate.
4.2
Search
One of the major reasons that causal inference is harder than purely statistical inferences (estimating a probability distribution or features of a distribution from a sample) is that reliable statistical inference requires finding only one simple model that fits the data well, whereas reliable causal inference requires finding all simple models that fit the data well. Suppose, for example, that SEM C is the true causal SEM. Let G be the a SEM that has a DAG that is the same as the DAG of SEM C, but reverses the direction of the Z ← W edge to Z → W , i.e. the DAG for SEM G is X ← Y ← Z → W . SEMs G and C are distribution equivalent (relative to a linear Gaussian parameterization). Because C and G represent the same set of distributions, maximum likelihood estimates of the parameters that characterize the joint distribution are the same. If a search over SEMs found that G was a simple model that fit the data well, and the goal is to use the SEM only for statistical inference, the search can reasonably stop. The fact that some other model such as C is another simple model that fits the data well does not prevent G from providing reliable inference about the joint distribution. In contrast, if a SEM is to be used for causal inference, the fact that there is another SEM C that is as simple as SEM G and fits the data equally well does affect whether G can be reliably used for causal inference. The difference is that SEMs G and C make different predictions about the effects of some (but not all) manipulations. Hence, if a search of SEMs yielded that SEM G was simple and fit the data well, it is not possible to draw that conclusion that SEM G could be used for reliable causal inference. Finding that another SEM C is equally simple and fits the data equally well, but makes different predictions about the effects of some manipulations indicates that it is not possible to draw reliable conclusions about whether SEMs G or C is correct about the effects of the manipulations. On the other hand, if all of the SEMs that are equally simple and fit the data equally well agree about the effect of a manipulation, then the effect of the manipulation can be reliably inferred. This implies that searching for SEMs that are to be used for causal inference requires not just locating one good SEM, but all good SEMs. 4.2.1
Informal Search
In practice, search for causal models with unobserved common causes is often informal, and based on a combination of background assumptions together with statistical tests of the causal models. If a model is rejected by a statistical test, the researcher looks for a modification of the original model that will pass a statistical
Common Cause in Causal Inference
799
test. The search typically halts when a model that is compatible with background assumptions does not fail a statistical test. Often, only the final model is presented, and the search itself is not described. Searches of this kind are only as reliable as the reliability of the background assumptions, and the extent to which the space of alternatives compatible with the background assumptions was searched. (For an example of a case where a search is described, see [Rodgers and Maranto, 1989]). Rodgers and Maranto show that different disciplines often start from very different causal models, and have different background assumptions, even when investigating the same question. Furthermore, unless the background assumptions are very extensive, or the number of variables is tiny, it is not feasible to estimate and test all of the models compatible with background assumptions. This is further complicated by the fact that for causal inference it is not sufficient to find one model that passes a statistical test; instead it is necessary to find all such models. 4.2.2
Score Based Search
For SEMs that have no unobserved common causes, the Greedy Equivalence Search [Chickering, 2003] returns a Markov equivalence class of SEMs that has the highest score (using one of several possible scores, including BIC). The output Markov equivalence class contains the true SEM with probability 1 in the large sample limit (under the Causal Markov and Faithfulness Assumptions, and assuming either a linear Gaussian model or a multinomial model with an acyclic causal graph). Although in the worst case the running time of the algorithm is exponential in the number of variables, in practice it can be used on graphs with relatively few edges but hundreds of variables. However, a score based search over SEMs with unobserved common causes is much more difficult for two reasons. There are an infinite number of SEMs to search, and it is not clear how to structure the search efficiently. Scoring a SEM with an unobserved common cause faces both theoretical and practical difficulties. No score-based search that is both feasible and reliable in the large sample limit is known at this time for SEMs with unobserved common causes. The LISREL and EQS programs have automated score-based searches that are versions of the informal search method described in section 4.2.1. However, these searches suffer from the difficulties one would expect from score-based searches over SEMs with unmeasured common causes. The scores depend upon being able to perform maximum likelihood estimates of the parameters, and the iterative algorithms necessary to do this often fail to converge or converge to the wrong values when the initial model is far from the truth. Even in cases in which the parameter estimates do not go awry, the searches have low reliability [Spirtes et al., 1990]. 4.2.3
Constraint Based Search
In contrast to a score based search, a constraint based search does not require the estimation of parameters, or the calculation of the dimensionality of the marginal
800
Peter Spirtes
distribution — it requires only being able to perform the appropriate tests of the constraints (e.g. conditional independence constraints) used in the search. Furthermore, although the number of DAGs with unobserved common causes is infinite, the number of Markov equivalence-over-O classes is finite. The Fast Causal Inference (FCI) algorithm [Spirtes et al., 2001; Spirtes et al., 1995; Zhang, 2007] performs a series of conditional independence tests and constructs a PAG on the basis of those tests. In the large sample limit, it returns a PAG that contains the true SEM with probability 1 under the Causal Markov and Causal Faithfulness Assumptions (for linear Gaussian densities or multinomial distributions). The major differences between using constraint based searches under the assumption of no unobserved common causes, and constraint based searches without that assumption are that in the latter case the algorithms are computationally more complex, and the output is less informative. The PAG in Figure 5 was constructed by the FCI algorithm [Spirtes et al., 2001] from data on 10,318 high school students [Sewell and Shah, 1968]. 4.2.4
Search Using Vanishing Tetrad Constraints
Searches using conditional independence constraints are correct, but completely uninformative for some common kinds of data sets. Consider the SEM in Figure 6.
3"! !!
3! ! # ! 3$! ! 3 ! ! %& ! 3%%! ! 3 ! ! %' ! 3%#! !
000000000000001'2"0!
3(! 3)! 3*! !
,-.().*'#)(/! !"#$%&"$"'()*+!
3'! 3%+! ! !
Figure 6. SEM S The data comes from a survey of test anxiety indicators administered to 335 grade 12 male students in British Columbia [Gierl and Todd, 1996]. The survey contains 20 measures of symptoms of anxiety under test conditions. Each question is about a symptom of anxiety. For example, question 8 is about how often one feels “jittery when taking tests”. The answer is observed on a four-point scale (almost never, sometimes, often, or almost always) ; it will be assumed that the variables are approximately Gaussian. Each X variable represents an answer to a question on the survey. For reasons to be explained later, not all of the questions
Common Cause in Causal Inference
801
on the test have been included in the model. There are three unobserved common causes in the model: Emotionality, Care about achieving (which will henceforth be referred to as Care) and Self-defeating. The test questions are of little interest in themselves; of more interest is what information they reveal about some unobserved psychological traits. If S is correct, there are no conditional independence relations among the X variables alone — the only entailed conditional independencies require conditioning on an unobserved common cause. Hence the FCI algorithm would return a completely unoriented PAG in which every pair of variables in X is adjacent. Such a PAG makes no predictions at all about the effects of manipulations of the observed variables. Furthermore, in this case, the effects of manipulating the observed variables (answers to test questions) is of no interest — the interesting questions are about the effects of manipulating the unobserved variables and the qualitative causal relationships between them. Models such as S are multiple indicator models, and can be divided into two parts: the measurement model, which contains the edges between the unobserved variables and the observed variables (e.g. Emotionality → X2 ), and the structural model, which contains the edges between the unobserved variables (e.g. Emotionality → Care). The various X variables in S({X2 , X3 , X5 , X7 , X8 , X9 , X10 , X11 , X14 , X16 , X18 }) were chosen with the idea that they indirectly measure some psychological trait that cannot be directly observed. Ideally, the X variables can be broken into clusters, where each variable in the cluster is caused by exactly one unobserved cause common to the members of the cluster, and a unique error term uncorrelated with every other error term, and nothing else. From the values of the variables in the cluster, it is then easy to make inferences about the value of the unobserved common cause. Such a measurement model is called pure. The measurement model of S is pure. If the measurement model is impure (i.e. there are multiple common causes of a pair of variables in X, or some of the X variables cause each other) then drawing inferences about the values of the common causes is much more difficult. Consider the set X′ = X ∪ {X15 }. If X10 caused X15 , the measurement model over the expanded set of variables would not be pure. If a measurement model for a set X′ of variables is not pure, it is nevertheless possible that some subset of X′ , such as X, has a pure measurement model. If the only reason that the measurement model is impure is that X10 causes X15 then X = X′ \{X15 } does have a pure measurement model, because all the “impurities” have been removed. S does not contain all of the questions on the survey precisely because various tests described below indicated that they some of them needed to be excluded in order to have a pure measurement model. The task of searching for a multiple indicator SEM model can then be broken into two parts: first finding clusters of variables so that the measurement model is pure; second, use the pure measurement model to make inferences about the structural model. Factor analysis is often used to determine the number of unmeasured common
802
Peter Spirtes
causes in a multiple indicator model, but there are important theoretical and practical problems in using factor analysis in this way. Factor analysis constructs models with unobserved common causes (factors) of the observed X variables. However, factor analysis models typically connect each unobserved common cause (factor) to each X variable, so the measurement model is not pure. A major difficulty with giving a causal interpretation to factor analytic models is that the observed distribution does not determine the covariance matrix among the unobserved factors. Hence, a number of different factor analytic models are compatible with the same observed data [Harman, 1976]. In order to reduce the underdetermination of the factor analysis model by the data, it is often assumed that the unobserved factors are independent of each other; however, this is clearly not an appropriate assumption for unobserved factors that are supposed to represent actual causes that may causally interact with each other. In addition, simulation studies indicate that factor analysis is not a reliable tool for estimating the correct number of unobserved common causes [Glymour, 1998]. On this data set, factor analysis indicates that there are 2 unobserved direct common causes, rather than 3 unobserved direct common causes [Bartholomew et al., 2002]. If a pure measurement model is constructed from the factor analytic model by associating each observed in X only with the factor that it is most strongly associated with, the resulting SEM fails a statistical test (has a p-value of zero) [Silva et al., 2006]. A search for pure measurement models that depends upon testing vanishing tetrad constraints (described below) is an alternative to factor analysis. Conceptually, the task of building a pure measurement model from the observed variables can be broken into 3 separate tasks: 1. Select a subset of the observed variables that form a pure measurement model. 2. Determine the number of clusters (i.e. the number of unobserved common causes) that the observed variables measure. 3. Cluster the observed variables into the proper groups (so each group has exactly one unobserved direct common cause.) It is possible to construct pure measurement models using vanishing tetrad constraints as a guide [Silva et al., 2006]. A vanishing tetrad constraint holds when Cov(X, Y ) · Cov(Z, W ) − Cov(X, Z) · Cov(Y, W ) = 0. A pure measurement model entails that each Xi variables is independent of every other Xj variable conditional on its unobserved parent, e.g. S entails X2 is independent of Xj conditional on Emotionality. These conditional independence relations cannot be directly tested, because Emotionality is not observed. However, together with the other conditional independence relations involving unobserved variables entailed by S, they imply vanishing tetrad constraints on the observed variables that reveal information about the measurement model that does not
Common Cause in Causal Inference
803
depend upon the structural model among the unobserved common causes. The basic idea extends back to Spearman’s attempts to use vanishing tetrad constraints to show that there was a single unobserved factor of intelligence that explained a variety of observed competencies [Spearman, 1904]. Because X2 and X8 have one unobserved direct common cause (Emotionality), and X3 and X5 have a different unobserved direct common cause (Care), SEM S entails CovS (X2 , X3 ) · CovS (X5 , X8 ) = CovS (X2 , X5 ) · CovS (X3 , X8 ) 6= CovS (X2 , X8 ) · CovS (X3 , X5 ) for all values of the free parameters.22 (This is easy to see using the trek sum rule for calculating covariances.) On the other hand, because X2 , X8 , X9 , and X10 all have one unobserved common cause (Emotionality) as a direct common cause, the following vanishing tetrad constraints are entailed by SEM S : CovS (X2 , X8 ) · CovS (X9 , X10 ) = CovS (X2 , X9 ) · CovS (X8 , X10 ) = CovS (X2 , X10 ) · CovS (X8 , X9 ) [Spirtes et al., 2001]. The BuildPureClusters algorithm uses the vanishing tetrad constraints as a guide to the construction of pure measurement models, and in the large sample limit reliably succeeds if is a pure measurement model among a subset of the observed variables exists [Silva et al., 2006] and each unmeasured common cause has at least three pure indicators. Once a pure measurement model has been constructed, it is possible to estimate the covariances among the unobserved common causes. These covariances can then be given as input to an algorithm such as FCI to return a PAG among the unobserved common causes. In this example, the PAG returned contains an undirected edge between every pair of unobserved common causes. (SEM S is an example that is compatible with the PAG, but any other orientation of the edges among the three unobserved common causes that does not create a cycle is also compatible with the pattern.) The resulting SEM (or set of SEMs) passes a statistical test with a p-value of 0.47.
4.3
Calculating the Effects of Interventions from SEMs with Unobserved Common Causes
Given a SEM without unobserved common causes, there is a simple algorithm for estimating the total effect of a manipulation. First, write the total effect of the manipulation as a function of the free parameters using the path sum rule; second, estimate the values of the free parmeters; third, substitute the estimated values of the free parameters into the equation for the effect of manipulation. This algorithm is applicable to some SEMs with unobserved common causes as well, e.g. SEM S of Figure 6. There are algorithms that can consistently estimate the free parameters of SEM S from just the observed X variables (even though it is not possible to estimate the structural coefficients by simply regressing each variable on its parents, because the parents are unobserved). 22 The inequality is based on an extension of the Causal Faithfulness Assumption that states that vanishing tetrad constraints that are not entailed for all values of the free parameters by the true causal graph are assumed not to hold.
804
Peter Spirtes
One commonly used method for estimating causal effects when there are unmeasured common causes is through the use of instrumental variables [Staiger and Stock, 1997]. The simplest kind of instrumental variable model has the causal graph shown in Figure 7 where GAS PRICE is the instrument. There are no regressions on the observed variables that are unbiased estimates of the linear coefficient relating OPERATING COSTS to MILES (because of the unobserved common cause T ). Moreover, the structural coefficients relating T to MILES and OPERATING COSTS are not identifiable from the observed covariance matrix. However, the structural coefficient relating OPERATING COSTS to MILES is equal to Cov(GAS PRICE, MILES )/Cov(GAS PRICE, OPERATING COSTS ), and hence is still identifiable.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! """""" #$%"&'()*!→!+&*'$!(,#")+%!%!→!-(.*%! Figure 7. Instrumental Variable Model In order to perform instrumental variable estimation, it is necessary to find some variable Xto serve as an instrument; ideally for accurate estimations, X is strongly correlated with the cause, and uncorrelated with the error term of the effect [Staiger and Stock, 1997]. Unfortunately, in simple SEMs such as Figure 7 it is not possible to test whether X is an instrument (e.g. that X does not directly cause Z); this must be concluded from background assumptions about the domain. Recent algorithms have generalized instrumental variable estimation to a wider class of models. Another approach to estimating the effects of manipulations in some models with unobserved common causes is based on a causal calculus that takes as input a causal graph and the marginal probability distribution over the observed variables [Spirtes et al., 2001; Pearl, 2000]. The causal calculus has been incorporated into algorithms that search for an expression that relates the effect of a manipulation directly to the marginal probability distribution over the observed variables (instead of relating the manipulation to a function of the free parameters). The recently constructed algorithms have been shown to be complete (they either returns an answer or says “can’t be estimated”) [Pearl, 1995; Huang and Valtorta, 2006; Shpitser and Pearl, 2006a; Shpitser and Pearl, 2006b]. However, the algorithm requires knowing the true SEM, which in general cannot be determined by the observed marginal distribution alone. In cases where a SEM is linear, but at most one of the errors is Gaussian, recently developed algorithms calculate the effect of one manipulated observed variable on other observed variables from the observed marginal distribution [Hoyer et al.,
Common Cause in Causal Inference
805
2006; Hoyer et al., in press]. 4.3.1
Calculating the Effects of Manipulation from a Markov Equivalence-over-O Class of SEMs with Unobserved Common Causes
Suppose that instead of knowing the true SEM, it is known that the true SEM lies in a Markov equivalence-over-O class represented by a PAG (which is the case unless background assumptions are strong), e.g. the PAG for SEM G. It is not known whether the true SEM is SEM B, SEM C, SEM G, or some other SEM in the Markov equivalence-over-O class that has unobserved common causes. There is an algorithm that takes a PAG as input, and in some cases where all of the SEMs that are represented by the PAG agree on the effects of a particular manipulation, return a formula for the effect of the manipulation in terms of the marginal distribution over O, and otherwise returns “Can’t tell”. Because a Markov equivalence-over-O class is a proper superset of a Markov equivalence class over SEMs without unobserved common causes, stronger inferences can be drawn when the assumption of no unobserved common causes is made. For example, because the only variables without unobserved common causes in the Markov equivalence-over-O class are SEMs C and G, if it is known that there are no unobserved common causes then it is known that X causes Y . In contrast, if there is a possibility that there are unobserved common causes, then the true SEM may have an unobserved common cause of X and Y . So given the assumption of no unobserved common causes and hence the edge X → Y it is possible to calculate the effect of X on Y ; but given the PAG (i.e. allowing the possibility of unobserved common causes and the edge X → Y ) it is not possible to calculate the effect of X on Y . However, even allowing for the possibility of unobserved common causes, it is possible to calculate that the effect of Y on X is zero from the PAG (from the arrowhead at X → Y ). There are also examples where nonzero effects can be calculated from PAGs. For example, it is possible to calculate the effect of parental encouragement on college plans from the PAG of Figure 5. Introducing the possibility of unobserved common causes reduces the number of manipulated quantities that can be calculated, and increases the complexity of calculating the manipulated quantities that can still be calculated.
4.4
Summary
In summary, the following steps constitute a reliable method of estimating the effects of manipulations from sample data and background assumptions (e.g. acyclicity) when there are no unobserved common causes. 1. Perform a search to find the best SEM or set of SEMs. For some parametric families (e.g. linear Gaussian or discrete) the best that can be done is to find the best Markov equivalence-over-O class of SEMs. For other parametric families it is possible to find a single SEM (e.g. linear but non-Gaussian error terms.)
806
Peter Spirtes
2. Determine if all of the SEMs in the best set of SEMs agree on the effect of a manipulation. (If this is a single SEM, this is trivial. If it is a Markov equivalence class of SEMs, there are algorithms for doing this. For intermediate cases, further research is needed.) If they do agree, output a formula that relates the effect of the manipulation to the marginal distribution over the observed variables entailed by the SEMs in the best class. If they don’t, output “Can’t tell” and halt. 3. For one of the SEMs in the best set of SEMs, estimate the entailed marginal density over the observed variables. 4. Substitute the estimated marginal density over the observed variables into the formula relating the effect of the manipulation to the entailed observed marginal density of the model.23 5
OPEN PROBLEMS24
The questions listed below are areas of active research that have produced some answers, but many open questions remain. The citations given are just a sample from larger bodies of research.
5.1
Models
There are a wide variety of causal models that have been employed in different disciplines. These include Bayesian Networks, Chain Graphs, Partial Ancestral Graphs, Markov Decision Processes, Structural Equation Models, Propensity Scoring, Information Theory, and Granger Causality. The relative advantages and disadvantages of these models and the relationships between these models are partially, but not completely understood. What new models are appropriate for different domains, e.g. feedback or reversible systems [Spirtes et al., 1993; Spirtes et al., 2001]? What new models are appropriate for different combinations of kinds of data, e.g. experimental and observational [Eberhardt and Glymour, 2006; Eberhardt et al., 2005; Yoo and Cooper, 2004; Yoo et al., 2006; Danks, 2002; Cooper and Yoo, 1999]? What new models are appropriate for different kinds of background assumptions, and different families of densities?
5.2
Model Scores
What kind of scores can be used to best evaluate causal models from various kinds of data? While some scores, such as BIC, have good large sample properties, they are difficult to compute or cannot be applied to some causal models, and may not 23 This works for simple cases, but in some cases the actual algorithm is slightly more complicated. 24 This section benefited greatly from suggestions made by Constantin Aliferis.
Common Cause in Causal Inference
807
have good small sample properties. In a related vein, what are good families of prior densities that capture various kinds of background assumptions?
5.3
Search Algorithms
How can search algorithms be improved to incorporate different kinds of background assumptions, search over different classes of causal models, run faster, handle more variables and larger sample sizes, be more reliable at small sample sizes, and produce output that is as informative as possible?
5.4
Properties of Search Algorithms
For existing and novel causal search algorithms, what are their semantic and syntactic properties (e.g. soundness, consistency, maximum informativeness)? What are their statistical properties (pointwise consistency, uniform consistency, sample efficiency)? What are their computational properties (computational complexity)?
5.5
Assumptions
What plausible alternatives are there to the Causal Markov and Faithfulness Assumptions? Are there other assumptions might be weaker and hold in more domains and applications without much loss about what can be reliably inferred? Are there stronger assumptions that are plausible for some domains that might allow for stronger causal inferences? How often are these assumption violated, and how much do violations of these assumptions lead to incorrect inferences? There are special assumptions, such as linearity, which can improve the strength of causal conclusions that can be reliably inferred, and the speed and sample efficiency of algorithms that draw the conclusions. What other density families or stronger assumptions about a domain are there that are plausible for some domains and how can they be used to improve causal inference? Can various statistical assumptions be relaxed? For example, what if the sample selection process is not i.i.d., but may be causally affected by variables of interest [Richardson and Spirtes, 2002; Spirtes et al., 1995; Cooper, 1995; Cox and Wermuth, 1996; Cooper, 2000]?
5.6
Deriving Consequences of Causal Models
Shpitser and Pearl have given complete algorithms for deriving the consequences of various causal models with hidden common causes in terms of the unmanipulated density and the given manipulation [Shpitser and Pearl, forthcoming]. Partial extensions of these results to deriving consequences from sets of causal models have been given [Zhang, 2008]; are there further extensions to derivations from sets of causal models? It is often useful to quickly derive constraints (e.g. vanishing tetrad constraints) on marginal densities from causal models with hidden common causes, in order to
808
Peter Spirtes
guide search. Are there other constraints on densities that can be derived from causal models, and how can they be incorporated into search algorithms?
5.7
Evaluation
What are the most appropriate performance measures for causal inference algorithms? What benchmarks can be established? What is the best research design for testing causal inference algorithms?
5.8
Interconnections
Many different domains have studied causal discovery including Artificial Intelligence, Econometrics, Markov Decision Processes, Operations Research, Control Theory, Experimental Design, and Statistics. What are the formal connections between the different models, assumptions, and algorithms used in each of these domains? What can each of these domains learn from the others? 6
APPENDIX
This appendix contains some definitions of the properties of estimators that are mentioned in the text. An estimator φn of a model parameter µ (such as the mean of a variable) is a function from samples On of size n to a real number - that is, for each sample, the estimator outputs an estimate of the quantity µ. The quality of an estimator φn of µ can be measured by its mean squared error : that is, the mean (over all randomly independently selected samples of size n) of (φn (On ) — µ)2 , the square of the difference between the real number output by the estimator and µ. Let φ¯n be the average output of φn (On ) (with respect to the sampling distribution of the population density). The mean squared error is the sum of two terms: the bias and the variance of the estimator. The bias of the estimator (φ¯n – µ), i.e. the difference between the mean output of the estimator and the true value µ. The variance of the estimator is the mean of (φn (On ) – φ¯n ))2 , i.e. mean of the squared difference between the output of the estimator and the average output of the estimator). A pointwise consistent estimator is one in which the mean squared error approaches zero as the sample size approaches infinity. The quality of an estimator at a finite sample size depends upon the two factors of the bias of an estimator and the variance of the estimator. (There are other desirable properties that estimators can have, such as being uniformly consistent — roughly that it is possible to put probabilistic bounds on the size of the error at a given sample size — that will not be discussed here. (See [Robins et al., 2003]). One kind of estimator that is commonly employed is a maximum likelihood estimator, which under mild regularity conditions has a number of desirable properties such as pointwise consistency. Suppose that SEM C is given and the goal is to estimate the values of the free parameters. For assignment θ of values to the free
Common Cause in Causal Inference
809
parameters of SEM C, there is an implied covariance matrix. Given the implied covariance matrix, it is possible to determine the density of drawing a sample that has the observed sample covariance matrix; this is the likelihood of the data for θ. A maximum likelihood estimator selects the assignment θ of values to the free parameters that makes the sample data most likely. In the case of a SEM over a causally sufficient set of variables, the maximum likelihood estimate of the linear coefficient bY Z (denoted by ˆbY Z ) is the regression coefficient of Z when Y is regressed on its non-descendants in the causal graph (e.g. in the case of SEM C, the regression coefficient of Z when Y is regressed on Z, X, and W ). In SEM C, the formula for the total effect of W on Y is m·bY Z ·bZ W . Substituting the maximum likelihood estimates of ˆbY Z and ˆbZW into the formula yields m · ˆbY Z · ˆbZ W , which is a maximum likelihood estimate of the total effect of W on Y .
ACKNOWLEDGEMENTS I would like to thank an anonymous reviewer for many helpful comments, and Constantin Aliferis for detailed suggestions about the description of open problems.
BIBLIOGRAPHY [Ali et al., 2005] R. A. Ali, T. S. Richardson, P. Spirtes, and J. Zhang. Towards Characterizing Markov Equivalence Classes for Directed Acyclic Graphs with Latent Variables. Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, 2005. [Bartholomew and Knott, 1999] D. J. Bartholomew and M. Knott. Latent Variable Models and Factor Analysis (Kendall’s Library of Statistics, 7). A Hodder Arnold Publication, 1999. [Bartholomew et al., 2002] D. J. Bartholomew, F. Steele, I. Moustaki, and J. I. Galbraith. The Analysis and Interpretation of Multivariate Data for Social Scientists (Texts in Statistical Science Series). Chapman and Hall/CRC, 2002. [Bollen, 1989] K. A. Bollen. Structural Equations with Latent Variables. Wiley-Interscience, 1989. [Cartwright, 1994] N. Cartwright. Nature’s Capacities and Their Measurements (Clarendon Paperbacks). Oxford University Press, USA, 1994. [Chickering, 2003] D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3, 507-554, 2003. [Cooper, 1995] G. F. Cooper. Causal Discovery from data in the presence of selection bias. Paper presented at the Fifth International Workshop on AI and Statistics, 1995. [Cooper, 2000] G. F. Cooper. Causal Modeling and Discovery Under Selection. Paper presented at the Sixteenth Uncertainty In Artificial Intelligence Conference, San Francisco, CA, 2000. [Cooper and Yoo, 1999] G. F. Cooper and C. Yoo. Causal Discovery from a Mixture of Data. Paper presented at the Fifteenth Uncertainty In Artificial Intelligence Conference, San Francisco, CA, 1999. [Cox and Wermuth, 1996] D. R. Cox and N. Wermuth. Multivariate Dependencies: Models, Analysis and Interpretation (Monographs on Statistics and Applied Probability). Chapman and Hall/CRC, 1996. [Danks, 2002] D. Danks. Learning the causal structure of overlapping variable sets. Lect Notes Comput Sc, 2534, 178-191, 2002. [Eberhardt, 2009] F. Eberhardt and C. Glymour. 4 N-1 Experiments Suffice to Determine the Causal Relations Among N Variables. Innovations in Machine Learning: Theory And Applications, 2009.
810
Peter Spirtes
[Eberhardt et al., 2005] F. Eberhardt, C. Glymour, and R. Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables. Paper presented at the 21st Conference on Uncertainty in Artificial Intelligence, 2005. [Fisher, 1970] F. Fisher. A Correspondence Principle for Simultaneous Equation Models. Econometrica, 38, 73-92, 1970. [Geiger et al., 2002] D. Geiger, D. Heckerman, H. King, and C. Meek. Stratified Exponential Families: Graphical Models and Model Selection. Ann. Statist, 30, 1412-1440, 2002. [Geiger and Meek, 1999] D. Geiger and C. Meek. Quantifier Elimination for Statistical Problems. Paper presented at the 15th Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, 1999. [Gierl and Todd, 1996] M. J. Gierl and R. W. Todd. A Confirmatory Factor Analysis of the Test Anxiety Inventory Using Canadian High School Students. Educational and Psychological Measurement, 56(2), 315-324, 1996. [Glymour, 1998] C. Glymour. What Went Wrong? Reflections on Science by Observation and the Bell Curve. Philosophy of Science, 65(1), 1-32, 1998. [Glymour, 1999] C. Glymour. Rabbit Hunting. Synthese: An International Journal for Epistemology, Methodology and Philosophy of Science, 121, 55-78, 1999. [Glymour, 1980] C. Glymour. Theory and evidence. Princeton, N.J: Princeton University Press, 1980. [Harman, 1976] H. H. Harman. Modern Factor Analysis. University Of Chicago Press, 1976. [Haughton, 1988] D. M. A. Haughton. On the Choice of a Model to Fit Data from an Exponential Family. The Annals of Statistics, 16(1), 342-355, 1988. [Hoover, 2001] K. D. Hoover. Causality in Macroeconomics. Cambridge University Press, 2001. [Hoyer et al., 2006] P. O. Hoyer, S. Shimizu, and A. Kerminen. Estimation of linear, nongaussian causal models in the presence of confounding latent variables. Paper presented at the Third European Workshop on Probabilistic Graphical Models, 2006. [Hoyer et al., in press] P. O. Hoyer, S. Shimizu, A. Kerminen, and M. Palviainen. Estimation of causal effects using linear non-gaussian causal models with hidden variables. International Journal of Approximate Reasoning, in press. [Huang, 2006] Y. Huang and M. Valtorta. Identifiability in Causal Bayesian Networks: A Sound and Complete Algorithm. Paper presented at the Twenty-First National Conference on Artificial Intelligence, Edinboro, Scotland, 2006. [Lauritzen, 2001] S. L. Lauritzen. Causal Inference from Graphical Models. In D. BarndorfNielsen and C. Kluppelberg(pp. 63-107). London/Baton Rouge: Chapman and Hall, 2001. [Lauritzen et al., 1990] S. L. Lauritzen, A. P. Dawid, B. N. Larsen, and H. G. Leimer. Independence properties of directed Markov fields. Networks, 20, 491-505, 190. [Pearl, 1995] J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4), 669-688, 1995. [Pearl, 1988] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. [Pearl, 2000] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000. [Ramsey et al., 2006] J. Ramsey, P. Spirtes, and J. Zhang. Adjacency-Faithfulness and Conservative Causal Inference. Paper presented at the 22nd Conference on Uncertainty in Artificial Intellgience, 2006. [Richardson and Spirtes, 2002] T. Richardson and P. Spirtes. Ancestral graph Markov models. Ann Stat, 30(4), 962-1030, 2002. [Robins, 1986] J. M. Robins. A new approach to causal inference in mortality studies with sustained exposure periods-application to the healthy worker survivor effect. Mathematical Modelling, 7, 1393-1512 1986. [Robins et al., 2003] J. M. Robins, R. Scheines, P. Spirtes, and L. Wasserman. Uniform consistency in causal inference. Biometrika, 90(3), 491-515, 2003. [Rodgers and Maranto, 1989] R. C. Rodgers and C. L. Maranto. Causal-Models Of Publishing Productivity In Psychology. J Appl Psychol, 74(4), 636-649, 1989. [Rubin, 1974] D. B. Rubin. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology, 66(5), 688-701, 1974. [Sewell and Shah, 1968] W. H. Sewell and V. P. Shah. Social Class, Parental Encouragement, and Educational Aspirations. Am J Sociol, 73(5), 559-572, 1968.
Common Cause in Causal Inference
811
[Shimizu et al., 2006] S. Shimizu, P. O. Hoyer, A. Hyv¨ arinen, and A. Kerminen. A Linear NonGaussian Acyclic Model for Causal Discovery. Journal of Machine Learning Research, 7, 2003-2030, 2006. [Shpitser and Pearl, 2006a] I. Shpitser and J. Pearl. Identification of Conditional Intervention Distributions. Paper presented at the 22nd Annual Conference on Uncertainty in Artificial Intelligence, Arlington, VA, 2006. [Shpitser and Pearl, 2006b] I. Shpitser and J. Pearl. Identification of Joint interventional Distributions in Recursive Semi-Markovian Causal Models. Paper presented at the Twenty-First National Conference on Artificial Intelligence, Menlo Park, California, 2006. [Shpitser and Pearl, forthcoming] I. Shpitser and J. Pearl. Complete Identification Methods for the Causal Hierarchy. Journal of Machine Learning Research, forthcoming. [Silva et al., 2006] R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning the structure of linear latent variable models. J Mach Learn Res, 7, 191-246, 2006. [Simon, 1985] H. A. Simon. Spurious Correlation: A Causal Interpretation. In H. M. Blalock (Ed.), Causal Models in the Social Sciences. (pp. 7-22). MacMillan, 1985. [Spearman, 1904] C. Spearman. General Intelligence objectively determined and measured. American Journal of Psychology, 15, 201-293, 1904. [Spirtes et al., 1995] P. Spirtes, C. Meek, and T. S. Richardson. Causal inference in the presence of latent variables and selection bias. Paper presented at the Eleventh Conference on Uncertainty in Artificial Intelligence, San Francisco, CA, 1995. [Spirtes et al., 1998a] P. Spirtes, T. Richardson, C. Meek, R. Scheines, and C. Glymour. Using path diagrams as a structural equation modeling tool. Sociol Method Res, 27(2), 182-225, 1998. [Spirtes and Richardson, 1996] P. Spirtes and T. S. Richardson. A Polynomial Time Algorithm For Determining DAG Equivalence in the Presence of Latent Variables and Selection Bias. Paper presented at the Proceedings of the 6th International Workshop on Artificial Intelligence and Statistics, 1996. [Spirtes et al., 1998b] P. Spirtes, T. S. Richardson, C. Meek, and R. Scheines. Using path diagrams as a structural equation modeling tool. Sociological Methods and Research, 27, 182-225, 1998. [Spirtes et al., 1990] P. Spirtes, R. Scheines, and C. Glymour. Simulation Studies of the Reliability of Computer-Aided Model Specification Using the TETRAD II, EQS and LISREL VI Programs. Sociological Methods and Research, 3-66, 1990. [Spirtes et al., 1993] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Spring-Verlag Lectures in Statistics, 1993. [Spirtes et al., 2001] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search, Second Edition (Adaptive Computation and Machine Learning). The MIT Press, 2001. [Staiger and Stock, 1997] D. Staiger and J. H. Stock. Instrumental variables regression with weak instruments. Econometrica, 65(3), 557-586 1997. [Strotz and Wold, 1960] R. H. Strotz and H. O. A. Wold. Recursive VS Nonrecursive SystemsAn Attempt At Synthesis. Econometrica, 28(2), 417-427, 1960. [Wyatt, 2004] G. Wyatt. Macroeconomic Models in a Causal Framework. Exempla Books, 2004. [Yoo and Cooper, 2004] C. Yoo and G. F. Cooper. An evaluation of a system that recommends microarray experiments to perform to discover gene-regulation pathways. Artif Intell Med, 31(2), 169-182, 2004. [Yoo et al., 2006] C. Yoo, G. F. Cooper, and M. Schmidt. A control study to evaluate a computer-based microarray experiment design recommendation system for gene-regulation pathways discovery. J Biomed Inform, 39(2), 126-146, 2006. [Zhang, 2007] J. Zhang. A Characterization of Markov Equivalence Classes for Directed Acyclic Graphs with Latent Variables. Paper presented at the 23rd Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, 2007. [Zhang, 2008] J. Zhang. Causal Reasoning with Ancestral Graphs. Journal of Machine Learning Research, 9, 1437-1474, 2008.
This page intentionally left blank
THE LOGIC AND PHILOSOPHY OF CAUSAL INFERENCE: A STATISTICAL PERSPECTIVE Sander Greenland The topic of causality is a vast one in science and philosophy, so vast that even a limited review would require a book. Yet most theories have not found favor among empirical researchers – by whom I mean those whose primary job is to collect and analyze data, as opposed to philosophers or theoreticians. This chapter will thus concern only the few statistical theories for causation and causal inference that have produced methods now widespread in practice. In an attempt to avoid the confusion that often accompanies narrative descriptions of causation and causal inference (especially in applied sciences), this chapter uses rather stark and purely logical descriptions, and will assume the reader has at least some familiarity with probability and statistics. Detailed illustrations and applications, along with philosophical discussions, can be found in the references. Special emphasis will be given to issues of causal inference from uncontrolled observations (observational studies), in which the effect under study becomes difficult to separate from other, extraneous phenomena; the latter are often called “bias sources” or “systematic errors”. Before embarking on these descriptions, I will touch briefly on the relevance (or possible lack thereof) for statistics of philosophies of causation. DO WE NEED PHILOSOPHY OF CAUSATION FOR A STATISTICAL THEORY OF CAUSAL INFERENCE? It is possible to distinguish two kinds of inference: Inference to causal models from observations, and inference from causal models to the effects of manipulations. Inference to causal models may be viewed as trying to construct a general set of laws from existing observations that can be tested with and applied to new observations. In statistics this problem is subsumed under the topic of model specification or model building. Inference from causal models may be viewed as deducing tests and making decisions based on proposed or accepted laws, which in statistics is subsumed under topics of testing, estimation, and decision theory. In applied statistics, the feedback between these two directions of inference is often summarized as a cycle of model proposal → model test → model revision → model test that continues until available tests cease to have practical impact on the model [Box, 1980]. There are familiar controversies about whether cycles of this form lead toward “truth” or simply toward effective tools for prediction and Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
814
Sander Greenland
manipulation (e.g., [Kuhn, 1970a; 1970b]), and whether the philosophical debate surrounding causal inference stems from the fact that the word “causation” evokes some notion of a deeper truth about the world hidden from current view. Of interest then is that the most successful statistical model of causation, the potential-outcomes model discussed below, has attracted theoretical criticisms precisely because it contains counterfactual elements hidden from randomizedexperimental test (e.g., [Dawid, 2000]). These criticisms have been dismissed by applied statisticians (see the discussion following [Dawid, 2000]), who understand that the manipulative account inherent in potential-outcomes models fits well with the more instrumentalist or predictive view of causation than critics admit. Indeed, these models can be and have been used to great success with no worry about whether their hidden elements need to be taken seriously [Greenland, 2004], just as the celestial cogs and wheels once used to display the Ptolemaic model of celestial motions were no obstacle to its considerable predictive success. Given this instrumentalist view, it might seem that causal inference maybe distinguished from other inferences only due to its emphasis on manipulation rather than prediction. From a statistical viewpoint, the distinction between prediction and causal inference is semantic, not philosophical: Causal inference is merely special case of prediction in which we are concerned with predicting outcomes under alternative manipulations. Because only one of the alternatives can be carried out, only one of the outcomes can be observed, resulting in nonidentification. But the solution to this problem is no different than in problems of pure prediction: We simply assume some limited form of isotropy, in which predictive regularities (whether labeled “predictive” or “causal”) persist over the space and time spans of interest, at least enough to justify generalizations across the spans. Whether a deeper analysis is warranted for practice remains to be seen. POTENTIAL OUTCOMES AND STRUCTURAL EQUATIONS Manipulative accounts of causation, including those with counterfactuals, have deep roots in the history of modern science. Informal outlines for causal inference may be traced as far back as the development of experimental science. After all, typical definitions of “experiment” include an element of experimenter “control” of conditions, implying that such control will affect the outcome. Early in these developments, however, Hume [1739; 1748, p. 115] recognized that the definition of “cause” implicit in much usage carried the seeds of intractable underdetermination or, as known in statistics, nonidentification; that is, observation alone could not determine or identify whether one condition caused another. To formalize the notion of causation and delineate the identification problem, consider the following model (introduced by Neyman [1923]) which became established in the experimental literature in the mid-20th century, was later extended and popularized for observational research [Rubin, 1990], and is now standard in much of statistics where it is called the potential-outcomes or “counterfactual” model. Suppose we observe a subject i to have a particular outcome after being
The Logic and Philosophy of Causal Inference: A Statistical Perspective
815
given a treatment. (“Subject” is here merely a term for observational unit; it may be a plot of land, a laboratory animal, or a population, as opposed to a person.) Let X be the variable ranging over the treatment possibilities, and let x and x∗ be two distinct treatments (that is, values for X with x∗ 6= x). Let Y be the variable ranging over the outcome (response) possibilities, and let y and y ∗ be two distinct outcomes (that is, values for Y with y ∗ 6= y). As an example, X might encode a range of treatment options for women with perimenopausal complaints, such as unopposed estrogen therapy, opposed estrogen therapy, placebo treatment, and no treatment, while Y could indicate survival over the decade following treatment initiation (Y = 1 if the woman survives, 0 if not), or Y could be the survival time (lifespan) following treatment initiation. A common notion of cause and effect is then captured as follows: Receiving treatment x (i.e., having X = x) caused the outcome Y to be y for the subject relative to having instead X = x∗ , if the actual outcome and treatment were y and x, but would have been y ∗ had x∗ been administered instead. The two outcomes y and y ∗ are then called the potential outcomes corresponding to treatments x and x∗ for the subject, and the difference y − y ∗ is called a measure of the effect on the subject of giving X = x instead of (or relative to) X = x∗ . In the example, y − y ∗ could be the difference of survival time with unopposed estrogen therapy (x) versus placebo (x∗ ). The nonidentification problem reflects that the subject’s response y ∗ to treatment x∗ is not observed if the received treatment x is not equal to x∗ , and therefore the effect y − y ∗ cannot be computed from the observations. For example, we cannot observe how long a woman would survive under placebo therapy if in reality she receives unopposed estrogen, and so we cannot compute the effect that receiving unopposed estrogen rather than placebo had on this woman. This problem in causal statements is often highlighted by noting that the premise of the conditional “If X had been x∗ instead of x, Y would have been y ∗ ” is counterfactual (contrary to fact) [Lewis, 1973ab]. Any statistical inference about the effect must therefore invoke physical assumptions that give precise meaning to the counterfactual conditional. It must also invoke “identification” assumptions that allow construction of estimates and tests of the average or expected effect E(Y − Y ∗ ) from the data actually observed. For further details and citations on the model see [Greenland et al., 1999; Greenland, 2004], or some of the many other reviews (e.g., [Morgan and Winship, 2007; Pearl, 2009]). The reader is warned however that (notwithstanding the enormous contributions by Rubin to the model), much of the sociologic literature misattributes the model to Rubin, some going so far as to call the model the “Rubin Causal Model” (e.g., [Holland, 1986]), and thus misses large segments of the literature on the model in the experimental, econometric, and health sciences.
816
Sander Greenland
Causal Laws and Structural Equations Basic science often provides an empirical pattern or a more formal physical theory that predicts the subject’s outcome Y as a function of the treatment X. This function might be subject-specific, tailored to specifics of the subject’s characteristics. For example, suppose the subject is an ordinary ceramic dish, the treatment is dropping it flat on a concrete floor from height x, and the outcome is breakage (Y = 1) or not (Y = 0). Ordinary experience provides us a rough theory that says dropping the dish two meters will cause breakage whereas dropping it one millimeter will not; that is, X = 2000mm will cause Y = 1 relative to X = 1mm. A different theory would apply to a steel dish. Especially in physics, such experience may eventually give rise to a mathematical “law” or model f (x) relating Y to X for each of a broad class of subjects. Examples include laws governing behavior of charged particles in response to an electric field of a given strength, or more limited laws governing the size of predator populations in response to prey abundance. Such theoretical mechanisms or laws shift the uncertainty about the counterfactuals (which are unobserved potential outcomes) to uncertainty about the mechanisms or laws connecting the outcome variable Y to the antecedent variable X. Indeed, that shift is often promoted as a major force for progress in experimental science, as follows: Suppose one proposes a general physical theory that says or implies that Y = f (x) whenever X = x for each subject in a given class. Observing these predictions fail — that is, observing enough subjects in the class who have Y 6= f (x) — can then be grounds for discarding or modifying the theory. A functional relation Y = fi (x) supplying the outcome of subject i under different possible treatments is often called a structural equation for the subject. It is important to note that the equation gives the variation in the outcome Y as X varies within a single subject; that is, it shows how Y varies as X is varied while i is held constant. This single-subject feature captures the counterfactual nature of causal laws. Although it is not the defining feature of a structural equation, the withinsubject property distinguishes the equation from ordinary “regression” functions of statistics, which describe associations. An association is the variation in Y as one moves across subjects with different X; that is, both X and i are varying in a regression. Nonetheless, because analytic statistical methodology is heavily invested in estimating associations, and because associations are all that are statistically identified in nonexperimental settings, much of statistical theory for causal inference comprises delineation of assumptions that allow deduction of regression equations from structural equations [Berk, 2003; Pearl, 2009].
Causal Null Hypotheses A large portion of standard statistical theory concerns testing of associational “null hypotheses,” which assert absence of association in a population or distribution from which the observations are supposed to have arisen. In causal inference these
The Logic and Philosophy of Causal Inference: A Statistical Perspective
817
nulls become no-effect hypotheses or “causal nulls,” which in their strong or strict form state that the structural equation is constant for each subject; that is, fi (x) = ci for each subject i. This hypothesis need not correspond to the hypothesis of no association, which asserts that Y does not change across different subjects with different X values. But much of statistical theory for causal inference comprises delineation of assumptions that allow deduction of associational null hypotheses from causal null hypotheses. When such an associational null is identifiable, its rejection by a statistical test implies that the causal null hypotheses should be rejected as well.
The Causal Identification Problem As just described, there is one element that makes a theory causal and which demands more than mere observation for inference: The multiple possible values for Y for each subject i, given different values of X. Let xi be the observed value of X for subject i, and suppose we have a theory that predicts Y as a function fi (xi ) of X. No matter how many subjects yield the predicted outcome fi (xi ) upon passive observation, and so are in accord with the causal theory that Y = fi (x), that theory cannot be deemed more than a good description of how Y and X will be associated as one looks across subjects. In terms of passive observation, it merely predicts how plots the pairs xi , yi will look as i varies across subjects. The theory’s predictions may be borne out among the treatment assignments (X distribution) we observe, but we cannot be sure the theory would have succeeded under other, counterfactual treatment assignments. In algebraic terms, we might see Yi = fi (xi ) for all the subjects, even though Yi might not have equaled fi (x∗ ) for some x∗ 6= xi . In sum, a theory may be a very good description of what we observe but a poor predictor of what we would observe upon intervening to change treatment assignment X. In other words, no matter how well it predicts associations across subjects, it need not tell us how the outcome Y would change upon changing X within subjects.
Confounding and Randomization Whether by the investigator, nature, or another party, the treatment xi given subject i might have been determined in a way that is associated with Y across subjects, apart from any causal (within-subject) link from X to Y . The term confounding is often used to refer to this condition [Greenland et al., 1999], although it is also known as “nonignorability of the treatment-assignment mechanism” [Rubin, 1991]. Another way to describe this condition is that there is between-subject variation in Y that is not due to within-subject variation of Y with X; in other words, given a fixed value x for X, it is variation in the potential outcome Yx with X across subjects. Earlier, informal discussions of causal inference described this condition as “extraneous variation,” or that portion of between-subject association of X and Y that is not due to an effect of X on Y . They recognized that such
818
Sander Greenland
covariation of X and Y would remain present even if the causal null hypothesis were correct, and thus would distort causal inferences or tests of that hypothesis [Mill, 1843]. Sources of extraneous variation are sometimes called “confounders,” although the term confounder is often defined in more strict terms [Greenland and Pearl, 2007]; see the section on causal diagrams below). Recognizing the impossibility of eliminating or even knowing all confounders in biological work, R.A. Fisher [1932, 1935] developed an elegant theory of randomized experiments to allow statistical “control for” confounding. This control is accomplished by enforcing a known distribution for confounding under sufficiently strict causal hypotheses (such as the null hypothesis). In its basic form, randomization theory replaces vague ignorance about the degree of confounding with a fully specified probability distribution for the observed outcomes under the sharp null hypothesis of no effect Yi = fi (x) = ci (Y constant across X within i). Consider classical permutation inference (e.g., [Cox and Hinkley, 1974, Ch.6]) in an experiment that assigns values of X among N subjects, and let R be the treatment-assignment variable. R ranges over possible values for X; thus, when subject i is assigned to have X = x, the subject’s value ri for R is equal to x. Note however that the subject may deviate from assigned treatment, resulting in the actual value xa of X not being equal to the assigned value x (in which case X 6= R). If assignment R has no effect on Y , the observed outcomes y1 , . . ., yN should be the same regardless of the assignments r1 , . . ., rN ; in other words, they should be the ci in the null model Yi = fi (x) = ci . Thus, given this causal null hypothesis, we may regard the outcome list (y1 , . . ., yN ) as if it were fixed from the start of the experiment and thus independent of a subsequent random treatment allocation. A known randomization scheme then allows one to compute exact probabilities for any allocation of these fixed outcomes among the treatment levels, including the allocation observed. From these probabilities, we can compute null distributions of test statistics (such as the sample sum of cross products Σi xi yi used for trend tests) and thus compute P -values (“observed significance levels”) for testing the causal null hypothesis. If the possible treatment allocations are merely permutations of the actual treatment-assignment list r1 , . . ., rN , the resulting test is known as a permutation test for the effect. Test of means may also be used, noting that the total-sample mean of Y remains fixed under the null hypothesis, regardless of allocation. As a special case of the foregoing methods, consider binary X and Y with perfect random allocation of N1 subjects to X = 1 and N0 subjects to X = 0, and let N = N1 +N0 . Then Σi xi yi is the number of subjects in the X = 1, Y = 1 cell of the 2 by 2 table of X and Y . Under the causal null hypothesis, (y1 , . . ., yN ) is a fixed vector of potential outcomes, and so the Y = 1 marginal total in the experiment is fixed at y+ = Σi yi (so is the sample mean of Y, y+ /N , which is the proportion of subjects having Y = 1). Under the null, randomization becomes nothing more
The Logic and Philosophy of Causal Inference: A Statistical Perspective
819
than randomly allocating the fixed outcomes (y1 , . . ., yN ) to the X = 1 and X = 0 categories, which induces a hypergeometric distribution for Σi xi yi , as derived by Fisher for his exact test [Cox and Hinkley, 1974, Ch. 5].
Causality and Conditionality The conditionality problem illustrates how the introduction of a causal component into a statistical model can resolve previous ambiguities in choice of a statistical procedure. This resolution comes from explicitly modeling the otherwise hidden within-subject dimension underlying causal questions, and shows how statistical questions can arise even when no ordinary sample-to-population inference problem exists. There has been a long-running controversy in statistics concerning whether the use of permutation tests is justifiable when the sample distribution of Y (comprising the observed values y1 , . . ., yN ) is not fixed in advance by the investigator. Some of the arguments for these tests appeal to rather abstruse and somewhat controversial principles of ancillarity or conditionality [Little, 1989] while others are based on favorable repeated-sampling (frequency) properties when y1 , . . ., yN represent a sample from a distribution for Y . If one is concerned only with the observed N units, the question of association of X and Y among those sampled is purely empirical rather than inferential, in that it is answered by simply plotting or cross-tabulating the observed pairs (xi , yi ). In this regard, it is no different a question than asking the heights of the N tallest mountains on earth: Accept the reported measurements and you have your answer. The absence of a statistical inference problem in this descriptive question has led many to automatically identify the fixed-margin controversy and even causal inference with the problem of inference to a larger population. Add however the causal dimension and we have a problem of inference from observed properties of the sample (the observed distribution of the pairs xi , yi ) to unobserved properties of the same sample, namely the unobserved potential outcomes of the sampled subjects. As described above, the fixed Y margin can be deduced directly from the causal null hypothesis for the observed sample of units (i = 1, . . ., N ), with no reference to sampling from a larger population. In particular, the fixed Y margin is just a physical property of the sample under the causal null hypothesis among those sampled [Greenland, 1991]. There is however a connection to inference about a population from which the observed units are sampled. First, note that rejection of the null for the sample implies rejection for the population: The sample is part of the population, and thus finding an effect in the sample implies that there is an effect in the population (namely, in the part that composes the sample). The converse condition is that failure to reject the null for the sample should correspond to failure to reject for the population. This converse is not logically necessary, but violating it would amount to asserting that causation exists in a population even though we would not assert causation exists in the portion we observed. At the very least such an
820
Sander Greenland
assertion would seem paradoxical [Greenland, 1991].
The General Causal Inference Problem as a Missing-Data Problem To extend statistical reasoning about causation beyond the null hypothesis, it is essential to add a model for the distributions of the potential outcomes. In the general form of this model, the univariate outcome variable Y is replaced by a fixed, baseline covariate vector Y with components Yx indexed by values x of the treatment X; these Yx are the potential-outcome variables, one for each possible value x of X. (More generally, as described later, X may be a vector X that indexes the potential-outcome vector Y.) The treatment-allocation variable becomes a vector R of indicators with components Rx , where Rx = 1 if and only if component Yx of Y is observed. The observed value ri of R for subject i thus displays which component of Y was observed (if any), and is a vector of zeros except possibly a single 1 at the component corresponding to the actual treatment. As a consequence, r+i = Σx rxi is either 0 or 1 (no or one component of Y observed). This conceptual framework allows one to view causal-inference problems as special types of missing-data problems, in which only one component of Y can be observed (i.e., no more than one component of R can be 1), leaving the rest of Y as “missing data” [Rubin, 1991]. Because only one component of Y can be observed, the joint distribution Pr(Y=y) of the components of Y is not statistically identified — that is, distinct distributions for Y can lead to exactly the same distribution for the observations. Assuming that every subject has an observed outcome, those observations are the subject-specific dot products R′ Y = Σx Rx Yx , which equal the observed outcomes (the Yx for which Rx = 1), and R, which shows the treatment that was received (i.e., the Rx that equals 1). Without further assumptions or experimental control, all we may identify is Pr(R′ Y = r′ y, R=r) and functions of it, including the conditional distributions Pr(Yx = y|Rx = 1). In the nonexperimental settings, it may be prudent to search for the weakest identification assumptions consonant with our practical goals.
Standardization and Inference on Marginal Effects The goal of most statistical causal inference is to compare the marginal distributions Pr(Yx = y) of the Yx across X. These comparisons are identified under the independence assumption that for all x, Pr(Yx = y) = Pr(Yx = y|Rx = 1), sometimes called “weak ignorability” (the stronger but inessential condition that R is independent of Y is then called “strong ignorability”), and which corresponds to absence of confounding. The paradigmatic example arises when R is randomized by the investigator, for then R is independent of everything, including any potential-outcome vector Y. In nonrandomized studies, ignorability conditions are usually unacceptable assumptions. One strategy for this situation is to pretend instead that R and Y are
The Logic and Philosophy of Causal Inference: A Statistical Perspective
821
independent conditional on a vector of fully observed covariates Z, called strong ignorability given Z. A weaker condition sufficient for practical applications is the analogous set of X-specific relations, (1)
Pr(Yx = y|Rx = 1, Z = z) = Pr(Yx = y|Z = z)
for all x, called weak ignorability given Z. Now consider the following “standardization” formula, which biostatisticians, demographers, and epidemiologists may recognize as a modern version of the classical formula for “direct standardization” to the covariate distribution Pr(Z=z): (2)
Pr(Yx = y) = Σz Pr(Yx = y|Z = z)Pr(Z = z)
This equation is just a basic probability relation displaying Pr(Yx = y) as a covariate-probability weighted average of the covariate-specific potential-outcome probabilities Pr(Yx = y|Z = z). Applying assumption (1) to equation (2) yields (3)
Pr(Yx = y) = Σz Pr(Yx = y|Rx = 1, Z = z)Pr(Z = z).
This equation provides the desired marginals Pr(Yx = y) in terms of the observable distribution Pr(Yx = y|Rx = 1, Z = z), assuming the standardization is adequate to remove confounding. Equation (3) is thus sometimes termed “no confounding of the marginal effects given Z”. It is a weaker condition than assumption (1) (weak ignorability given Z) because deviations from (1) may average to zero over Z and thus preserve equation (3), or at least leave it an acceptable approximation. In other words, in theory, stratification by Z need not be sufficient for estimating conditional effects in order to be sufficient for estimating marginal effects. Using the fact that Pr(Rx = 1|Z = z) = Pr(Rx = 1, Z = z)/Pr(Z = z), equation (3) may be rewritten in an equivalent form (4)
Pr(Yx = y) = Σz Pr(Yx = y, Rx = 1, Z = z)/Pr(Rx = 1|Z = z).
This equation displays Pr(Yx = y) as an inverse-probability weighted (IPW) average of the unconditional observation probabilities Pr(Yx = y, Rx = 1, Z = z). The allocation probabilities Pr(Rx = 1|Z = z) that form the inverse weights are sometimes called “propensity scores” and can be generalized to continuous and time-dependent treatment processes [Robins, 1999ab]. Equation (4) shows how these scores, if known or at least identifiable, can be used to estimate a marginal potential-outcome distribution Pr(Yx = y) under the Z-conditional weak ignorability assumption, which licenses the derivation of equation (3) and hence (4) from equation (2).
Summary Classical permutation arguments are based on enforcing a distribution for the observation-indicator vector R (e.g., by treatment randomization) and then using
822
Sander Greenland
U
V W
Z Y
X
Figure 1. Example of a DAG this distribution as the source of subsequent probability statements about the data. These arguments are not heavily emphasized in most statistical training, and yet are the ones most directly linked to potential-outcome models of causation. They should be contrasted with regression statistics, which base their probability statement on assumed distributions for the observed outcome R′ Y given observed covariates Z. The advantage of the treatment-based approach to causal inference is clear when indeed the distribution of R is known or at least identifiable, as in experiments: It seems far better to use an identified distribution than one that is merely assumed. Nonetheless, in nonexperimental (observational) research, all distributions become mere assumptions when not based on known mechanisms. The choice between approaches then comes down to judgments about which assumptions are more plausible or at least more palatable. Current research on statistical methods for causal inference includes development of “multiply robust” procedures, which retain their statistical validity under broader conditions than either treatmentbased or outcome-based modeling [Kang and Shafer, 2007]. CAUSAL SYSTEMS AND CAUSAL DIAGRAMS Suppose now we have a time-sequenced set of structural equations, or causal system, in which (for example) the output w of a function f (u, v) may become an input to a later function g(u, w) with output x. (Equivalently, suppose that for a variable W with corresponding potential-outcome vector W, the observed value W = R′W W may be part of the index vectors for the potential outcomes of a subsequent variable.) We may then illustrate the system using a directed acyclic graph (DAG) in which arrows connect input variables to output variables [Pearl, 2009; Glymour and Greenland, 2008; Spirtes et al., 2001]. Such graph is a causal diagram if (as here) the arrows are interpreted as links in causal chains. Figure 1 provides an example, illustrating a system of structural equations U = fU (εU ), X = fX (u, w, εX ),
V = fV (εV ), Y = fY (v, w, x, εY ),
W = fW (u, v, εW ), Z = fZ (w, εZ ),
The Logic and Philosophy of Causal Inference: A Statistical Perspective
823
where the inputs εU , εV , εW , εX , εY , εZ are “purely random disturbances,” that is, inputs that are independent random variables (but not necessarily identically distributed). Traditionally, such disturbances are left implicit, i.e., understood to be present but not shown. The entire system may be viewed as a multivariate model for the graphed variables, with the graph encoding various constraints on the joint distribution of these variables [Lauritzen, 1996; Spirtes et al., 2001; Pearl, 2009]. In particular, the distribution of the disturbances induces a joint distribution of the graphed variables which obeys the Markov decomposition. That is, the joint distribution of the graphed variables decomposes into factors, one for each graphed variable, that give the probability of each variable given its graphical/functional inputs (“parents”). For Figure 1 the decomposition is (in an abbreviated notation) Pr(u, v, w, x, y, z) = Pr(u)Pr(v)Pr(w|u, v)Pr(x|u, w)Pr(y|v, w, x)Pr(z|w). Because U and V have no inputs within the system (they are “exogenous”), their factors Pr(u) and Pr(v) are unconditional. The decomposition provides not only the original joint distribution, but also a formula for the effect on that distribution of shifting the functional relations or distributions of any subset of the variables. For example, randomization of W usually refers to an intervention that replaces W = fW (u, w, εW ) by W = fW (ε∗W ) where the distribution of ε∗W (and hence W ) is determined by the investigator and hence is known. The resulting joint distribution of the variables is then Pr(u, v, w, x, y, z) = Pr(u)Pr(v)Pr(w)Pr(x|u, w)Pr(y|v, w, x)Pr(z|w), the factor Pr(w|u, v) being replaced by the new randomization distribution for W, Pr(w). The corresponding graph lacks arrows into W . Suppose that instead of randomizing W we intervene to force all values of W to a particular value w0 , without altering any other functional relation (if this can be done). We then replace the equation W = fW (u, w, εW ) by the equation W = w0 . The resulting joint distribution of the variables is then 0 except at W = w0 , where it is Pr(u, v, w0 , x, y, z) = Pr(u)Pr(v)Pr(x|u, w0 )Pr(y|v, w0 , x)Pr(z|w0 ), the factor Pr(w|u, v) being replaced by the new, forced distribution Pr(w0 ) = 1. The corresponding graph lacks arrows into W , and has W = w0 in place of W . Repeating this exercise for other values of W shows how the system responds to various interventions that fix or set W to particular values, again presuming this is can be done without disturbing other systemic relations [Pearl, 2009].
Some Useful Elements of Graph Theory To describe further consequences of the Markov decomposition relevant for causal inference we need several concepts from graphical probability theory. Two variables in a graph are adjacent if they are connected by an arrow. Consider a
824
Sander Greenland
“path” in a graph, a nonrepeating sequence of variables in which successive sequence members are adjacent. Effects of one variable on another are transmitted by causal sequences or causal pathways, which in a causal diagram are causal or directed paths in which each arrow points to the tail of the next arrow in the path. More precisely, given a causal diagram, the existence of a directed path from one variable to another is a necessary but not sufficient condition for an effect to occur. Thus, in Figure 1, U → W → Y means U can affect Y via its effect on W , but not that U does affect Y . The graph is acyclic if (as assumed here) no variable in the graph affects itself (meaning there is no feedback loop in the graph). A central problem of causal inference is distinguishing causation from mere probabilistic dependence, or “association” as it is often called. Graph theory provides a quick visual distinction, via the following concepts. A variable is a collider on a path if it is at a meeting of two arrowheads along the path; otherwise it is a noncollider on the path. In Figure 1, W is a collider on the path U W V but a noncollider on the paths U W Y and XW Y . A path is said to be closed or blocked at a collider, and open or unblocked at a noncollider. The entire path is open if has no collider, otherwise it is closed. Open paths include but are not limited to causal pathways, a fact which (as discussed below) reflects classic problems in causal inference. It is often helpful to think of associations as signals flowing through the graph. Given a graph, associations can flow through or be transmitted by open paths. Open paths themselves are merely conduits for the transmission, however. More precisely, given a graph, the existence of an open path between two variables is a necessary but not sufficient condition for an association between them. (It should be noted however that the presence of an open path will in practice almost always correspond to the presence of an association, although more likely a miniscule one if the path is long [Greenland, 2003].) Conversely, a sufficient (but not necessary) condition for two variables to be unassociated (independent) is that there is no open path between (that is, any path between them is closed). Thus, we can immediately spot the independencies that must hold in the graphed distribution by just seeing whether two variables have no open path between them. If a statistical test of these independencies rejects them (that is, detects associations where none should be, according to the graph), that result may be taken as evidence against the posited causal system that gave rise to the graph.
Biases and Confounding Now suppose we are interested in an effect of one variable (the target antecedent) on another variable (the target outcome). Any open path between these target variables that is not part of this effect is a biasing path, because it provides a pathway for association between the target variables that is not due to the target effect. To illustrate, suppose our interest is in the net (total) effect of the antecedent W
The Logic and Philosophy of Causal Inference: A Statistical Perspective
825
on the outcome Y in Figure 1. This effect equals the net association transmitted through all the causal paths from W to Y , which are W Y and W XY . There are, however, two other open paths from W to Y : W U XY and W V Y . Thus, the association we observe will be the net transmission through all four paths, which may be far from the net transmission through the two target paths W Y and W XY ; that is, the signal of interest may be seriously corrupted by unwanted signals through W U XY and W V Y . These unwanted signals (transmissions along biasing paths) are examples of biases, although the concept of “bias” subsumes other phenomena as well (such as distortions due to measurement error). Unconditionally, a biasing path for a net effect in a DAG must pass through a shared (“common”) cause of the target variables. What is more, it must consist of two segments, one being a causal path from the shared cause to the target antecedent, and the other a causal path from the shared cause to the target outcome that does not include the target antecedent. For example, suppose again that our inferential target is the net effect of W on Y in Figure 1, which has biasing paths W U XY and W V Y . W U XY can be decomposed into a causal path U W from U to W and a causal path U XY from U to Y , joined at the shared cause U of W and Y . Similarly, W V Y can be decomposed into a causal path V W from V to W and a causal path V Y from V to Y , joined at the shared cause V of W and Y . Any bias that is transmitted via a common cause of the target variables is an example of confounding, in that it contributes to the association of the target antecedent variable with the potential target outcomes (that is, it contributes to nonignorability). More generally, confounding arises from association that is transmitted along biasing paths that terminate with an effect on the target outcome. Thus, as described earlier, confounding is association due to “extraneous” effects on the target outcome. Those effects are said to “confound” the target effect. Correspondingly, a confounding path is a biasing path that terminates with an arrow into the outcome of interest [Greenland and Pearl, 2007]. In Figure 1, both of the biasing paths for the effect of W on Y (W U XY and W V Y ) are confounding paths, because both terminate with an arrow into (effect on) the target outcome Y. If confounding occurs, variables within the responsible confounding paths are often called confounders. Because the entire confounding path is open, any confounder must be linked by open paths to both target variables, and must have associations with both target variables. The converse is not correct, however: A variable associated with both target variables need not be a confounder. For example, when examining the net effect of U on Y in Figure 1, X could be associated with U and Y but would not be a confounder because it does not lie on a confounding path. Given a DAG with no conditioning, it can be shown that all biasing paths are confounding paths, and vice-versa. Conditioning, however, may open biasing paths, some of which may not be confounding paths. We thus now turn to the concept of conditioning in graphs.
826
Sander Greenland
Conditioning and Control Let G and C be disjoint subsets of variables in the graph, with g and c being sets of values for G and C. Independencies in the conditional distribution Pr(g|c) implied by the graph may then be seen using just a few more concepts. One key notion is that the open/closed status of a variable along a path is reversed by conditioning (stratifying) on the variable: A collider becomes open and noncollider becomes closed. As a consequence, the status of paths may reverse. For example, if we condition on W in Figure 1, the closed path XU W V Y becomes open and now can transmit associations; it thus becomes a confounding path for the X effect on Y . At the same time, the open paths XU W Y , XW V Y , and XW Y become closed and can no longer transmit associations, and thus are no longer confounding paths. It should be noted that, in accord with ordinary language, experimental sciences use the term “control” to refer to a physical alteration of a system to remove sources of bias, such as randomization. In observational sciences, however, “control of a variable” is often used more broadly to include conditioning on a variable, whether it removes bias or creates bias. Thus, conditioning on W in Figure 1 will “control” (remove) any confounding of the X effect on Y that was present in the original system, but at the same time may introduce new confounding by opening paths that were previously closed. Not all biasing paths opened by conditioning are confounding paths, however. For example, suppose our target effect is the net effect of U on V in Figure 1. Because there is no causal pathway from U to V , this effect is zero, or “null.” But conditioning on W will open the path U W V , allowing unwanted association to flow from U to V . In other words, conditioning on W transforms U W V into a biasing path which is not a confounding path, because it does not terminate with an arrow into V . Bias that results from such conditioning on a shared effect of the target variables is often called “Berksonian,” in honor of the discoverer of this type of bias, Joseph Berkson [Glymour and Greenland, 2008]. Conditioning can also produce bias due to closing target paths. For example, suppose our target effect is the net effect of U on Y in Figure 1 (associations transmitted via U XY, U W Y , and U W XY ). This target effect is equal to the unconditional association of U and Y , because there is no unconditional biasing path. Conditioning on W will open the path U W V Y , allowing unwanted association to flow from U to Y . In other words, conditioning on W transforms U W V Y into a biasing path (which is a confounding path, since it terminates in an arrow into Y ). But conditioning on W will also close the target paths U W Y and U W XY , blocking part of the effect (signal) of interest. The association of U and Y conditional on W may thus bear little resemblance to the target effect. Suppose instead that our interest is only in that part of the effect of U on Y not mediated by W (U XY in Figure 1). A standard strategy in the social-science literature is to then condition on W in order to block the effects mediated by W (U W Y and U W XY in Figure 1). Unfortunately, this literature almost always overlooks the fact that the same strategy can introduce bias via the pathways
The Logic and Philosophy of Causal Inference: A Statistical Perspective
827
opened by conditioning on the intermediate W (U W V Y in Figure 1) Conditioning on effects of a variable can also partially reverse its status on paths. For example, conditioning on a variable affected by a collider on a path from X to Y can partially open the path and hence can result in new bias if the rest of the path is open after the conditioning. For example, if our target is the net effect of X on Y in Figure 1, conditioning on Z (affected by W ) can partially close the confounding paths passing through W (XU W Y, XW Y , and XW V Y ) yet partially open the path XU W V Y , which becomes a confounding path.
Collider Bias, Response Bias, and Selection Bias Any biasing path opened by conditioning (whether full or partial) must pass through at least one collider; hence any bias that results from a newly opened path may be called collider bias [Greenland, 2003]. Of particular interest are those instances in which collider bias results from the process that determines how subjects come to be included in the statistical analysis of a target effect or association. The process is usually described in terms of subject response (to requests for participation) or subject selection (whether selection by the researcher or selfselection by the subject). Any bias that results from the process is thus often called “response bias” or “selection bias.” As mentioned above, “Berksonian bias” usually refers to situations in which both the target variables affect inclusion. A more general and accurate term for all these biases is inclusion bias. To describe their shared structural and graphical representation, suppose Z is an indicator variable for inclusion in the analysis, with Z = 1 if a subject is included and Z = 0 if not. All associations observed must then be conditional on Z = 1. To say inclusion is random (“random sampling” for analysis) means that the structural equation for Z is Z = εZ , where εZ is a random indicator independent of all other random disturbances; Z will then have no causes in the corresponding graph (it will be exogenous in the graph). But if inclusion is affected by more than one causal pathway, Z will appear in the graph as a collider or as a variable affected by a collider, and observed associations may suffer considerable bias from the forced conditioning on Z = 1. To illustrate, consider again Figure 1, with Z the inclusion indicator, and again with the target being the net effect of U on Y . Here, W may influence whether a subject is included or not. As a consequence, the forced conditioning on Z = 1 can partially open the path U W V Y , which becomes a confounding path, and partially close the target paths U W Y and U W XY . The net bias from these changes may be considerable, if Z is strongly affected by W , or minor if Z is only weakly affected by W .
828
Sander Greenland
DISCUSSION The above models can be extended to deal with another major source of bias in observational research, measurement error. Because the topic is quite involved and brings in many elements not related to causality, it has not been included here. Notably however, measurement error expands the nonidentification problem to noncausal associations, thus further weakening identification of causal effects from observed associations; see [Greenland, 2005a; 2009; 2010] for examples and discussion. The above models also extend to consideration of longitudinal (time-varying) treatments such as medical regimens, but again many technical elements arise in applications [Robins, 1999ab]. For a discussion of relations of the models to the sufficient-component cause model common in epidemiology (which is equivalent to the INUS model of Mackie [1965]) see [Greenland and Brumback, 2002; VanderWeele and Robins, 2008]. As may be apparent from the above presentation, statistical and structural representations of causation bypass most of the philosophic subtleties associated with the complex topic of cause and effect. This bypass has facilitated applications and may reflect the task-oriented attitude of most scientists. Nonetheless, it should not lead one to overlook some serious practical problems that are usually ignored. Perhaps the largest problem is the possible ambiguity in what it means to intervene on a variable or to “change” its level. This ambiguity can render ambiguous the concept of a potential-outcome vector Y for X. After all, if Yx is a counterfactual component of Y (as all but one component must be), its value may depend in a dramatic fashion on exactly how X would come to be x if x is counterfactual [Greenland, 2005b; Hern´ an, 2005]. In causal diagrams the same problem is utterly invisible. These are not fatal objections to the models, for the models have proven useful whenever the meaning of interventions and outcomes is unambiguous (for example when X is measles vaccination and Y is the subsequent occurrence of measles). But they are quite disconcerting when the models are used to make claims about the impact of (for example) “eradication of childhood disease”: The effect of such an ambiguous action depends dramatically on exactly how it is carried out (e.g., by vaccinating, by curing, or by killing children). A related problem of less practical concern, but nonetheless discomforting, is that potential-outcome models and their structural-equation generalizations seem to use an informal notion of causation to define actual effects. In particular, “setting” or forcing a treatment variable X to a particular level x (whether that is done in response to a random-number generator or in response to extraneous factors) is a causal command left undefined in the account. Here again the accompanying causal-diagram theory is silent, taking the causal interpretation of its arrows as a primitive. Regardless of any objections and problems, the statistics and observational science literature employing potential-outcome models has exploded over the past few decades, while causal diagrams have spread rapidly in the wake. It thus seems
The Logic and Philosophy of Causal Inference: A Statistical Perspective
829
important that those interested in issues of causality become familiar with these formal yet practical tools for causal inference. ACKNOWLEDGEMENTS I am grateful to Katherine Hoggatt and a referee for helpful comments on the initial draft of this paper. BIBLIOGRAPHY [Berk, 2004] R. A. Berk. Regression analysis: a constructive critique. Newbury Park, CA: Sage, 2004. [Box, 1980] G. E. P. Box. Sampling and Bayes inference in scientific modeling and robustness. J R Stat Soc Ser A; 143:383–430, 1980. [Cox and Hinkley, 1974] D. R. Cox and D. V. Hinkley. Theoretical statistics. New York: Chapman and Hall, 1974. [Dawid, 2000] A. P. Dawid. Causal inference without counterfactuals (with comments). J Am Stat Assoc 95: 407-428, 2000. [Fisher, 1932] R. A. Fisher. Statistical methods for research workers, 4th ed. London: 1932. [Fisher, 1935] R. A. Fisher. The design of experiments. Oliver and Boyd, Edinburgh, 1935. [Glymour and Greenland, 2008] M. M. Glymour and S. Greenland. Causal diagrams. In K. J. Rothman, S. Greenland and T. L. Lash. Modern Epidemiology, 3rd edition. Philadelphia, PA: Lippincott Williams & Wilkins, 2008. [Greenland, 1991] S. Greenland. On the logical justification of conditional tests for two-by-two contingency tables. Am Statist, 45:248–251, 1991. [Greenland, 2003] S. Greenland. Quantifying biases in causal models: Classical confounding vs collider-stratification bias. Epidemiology, 14:300–306, 2003. [Greenland, 2004] S. Greenland. An overview of methods for causal inference from observational studies. In A. Gelman and X. L. Meng, eds. Applied Bayesian modeling and causal inference from an incomplete-data perspective. New York: Wiley, 2004. [Greenland, 2005a] S. Greenland. Multiple-bias modeling for analysis of observational data (with discussion). J R Stat Soc series A, 168:267–308, 2005a. [Greenland, 2005b] S. Greenland. Epidemiologic measures and policy formulation: Lessons from potential outcomes (with discussion). Emerg Themes Epidemiol, 2:1–4, 2005b. (Originally published as “Causality theory for policy uses of epidemiologic measures”. Chapter 6.2 in: C. J. L. Murray, J. A. Salomon, C. D. Mathers, and A. D. Lopez, eds. Summary Measures of Population Health. Cambridge, MA: Harvard University Press/WHO, 291-302.) [Greenland, 2009] S. Greenland. Relaxation penalties and priors for plausible modeling of nonidentified bias sources. Statistical Science, 24:195-210, 2009. [Greenland, 2010] S. Greenland. Overthrowing the tyranny of null hypotheses hidden in causal diagrams. Ch. 22 in R. Dechter, H. Geffner, and J. Y. Halpern, eds. Heuristics, Probabilities, and Causality: A Tribute to Judea Pearl, pp. 365-382. London: College Pubications, 2010. [Greenland and Brumback, 2002] S. Greenland and B. A. Brumback. An overview of relations among causal modeling methods. Int J Epidemiol, 31:1030–1037, 2002. [Greenland and Pearl, 2007] S. Greenland and J. Pearl. Causal diagrams. In S. Boslaugh, ed. Encyclopedia of epidemiology. Thousand Oaks, CA: Sage Publications, 2007: 149–156. [Greenland et al., 1999] S. Greenland, J. M. Robins and J. Pearl. Confounding and collapsibility in causal inference. Statistical Science, 14:29–46, 1999. [Hern´ an, 2005] M. A. Hern´ an. Hypothetical interventions to define causal effects—afterthought or prerequisite? Am J Epidemiol, 162:618–620, 2005. [Holland, 1986] P. W. Holland. Statistics and causal inference (with discussion). J Am Stat Assoc, 81:945–970, 1986. [Hume, 1978] D. Hume. A treatise of human nature. Oxford: Oxford University Press, 1888; 2nd ed, 1978. (Original publication, 1739.)
830
Sander Greenland
[Hume, 1988] D. Hume. An Enquiry Concerning Human Understanding. Open Court Publishing Company, Chicago, 1988, p. 115. (Original publication 1748.) [Kang and Schafer, 2007] J. D. Y. Kang and J. L. Schafer. Demystifying double robustness: a comparison of alternative strategies for estimating a populationmean from incomplete data (with discussion). Statistical Science, 22:477-580, 2007. [Kuhn, 1970a] T. S. Kuhn. Reflections on my critics. In: Lakatos I, Musgrave A, eds. Criticism and the growth of knowledge. Cambridge: Cambridge University Press, 1970a. [Kuhn, 1970b] T. S. Kuhn. The structure of scientific revolutions, 2nd ed. Chicago: University of Chicago Press, 1970b, Chapter XIII. [Lauritzen, 1996] S. Lauritzen. Graphical Models. New York: Oxford, 1996. [Lewis, 1973a] D. Lewis. Causation. J Philos, 70:556–567, 1973a. (Reprinted with postscript in: Lewis D. Philosophical papers. New York: Oxford University Press, 1986.) [Lewis, 1973b] D. Lewis. Counterfactuals. Blackwell, Oxford, 1973b. [Little, 1989] R. J. A. Little. On testing equality of two independent binomial proportions. Am Statist, 43:283–288, 1989. [Mackie, 1965] J. L. Mackie. Causes and conditions. Am Philos Q, 2:245–255, 1965. Reprinted in Sosa E, Tooley M, eds. Causation. New York: Oxford, 1993, 33–55. [Mill, 1956] J. S. Mill. A System of Logic, Ratiocinative and Inductive. Longmans Green, London 1956, Chapter X. (Original publication 1843) [Morgan and Winship, 2007] S. L. Morgan and C. Winship. Counterfactuals and Causal Inference: Methods and Principles for Social Research. New York: Cambridge University Press, 2007. [Neyman, 1990] J. Neyman. On the application of probability theory to agricultural experiments: Essay on principles, Section 9. 1923; Partial translation from the original French in Statistical Science, 5, 465-480, 1990. [Pearl, 2009] J. Pearl. Causality, 2nd ed. New York: Cambridge, 2009. [Robins, 1999a] J. M. Robins. Marginal structural models versus structural nested models as tools for causal inference. In Statistical Models in Epidemiology: The Environment and Clinical Trials, M.E. Halloran and D. Berry. eds., IMA Volume 116, pp 95-134. New York: Springer-Verlag, 1999a. [Robins, 1999b] J. M. Robins. Association, causation, and marginal structural models. Synthese, 121: 151-179, 1999b. [Rubin, 1923] D. B. Rubin. Comment: Neyman (1923) and causal inference in experiments and observational studies. Statistical Science, 5:472–480, 1990. [Rubin, 1991] D. B. Rubin. Practical implications of modes of statistical inference for causal effects, and the critical role of the assignment mechanism. Biometrics, 47:1213–1234, 1991. [Spirtes et al., 2001] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search. Cambridge, MA, MIT Press, 2001. [VanderWeele and Robins, 2008] T. J. VanderWeele and J. M. Robins. Empirical and counterfactual conditions for sufficient cause interactions. Biometrika, 95:49-61, 2008.
Part X
Some Philosophical Issues Concerning Statistical Learning Theory
This page intentionally left blank
STATISTICAL LEARNING THEORY AS A FRAMEWORK FOR THE PHILOSOPHY OF INDUCTION Gilbert Harman and Sanjeev Kulkarni Statistical Learning Theory (e.g., [Hastie et al., 2001; Vapnik, 1998;, 2000; 2006; Devroye et al., 1996]) is the basic theory behind contemporary machine learning and pattern recognition. We suggest that the theory provides an excellent framework for the philosophy of induction (see also [Harman and Kulkarni, 2007]). Inductive reasons are often compared with deductive reasons. Deductive reasons for a conclusion guarantee the conclusion in the sense that the truth of the reasons guarantees the truth of the conclusion. Not so for inductive reasons, which typically do not provide the same sort of guarantee. One part of the philosophy of induction is concerned with saying what guarantees there are for various inductive methods. There are various paradigmatic approaches to specifying the problem of induction. For example, Reichenbach [1949] argued, roughly, that induction works in the long run if anything works in the long run. His proposal has been followed up in interesting ways in the learning in the limit literature (e.g. [Putnam, 1963; Osherson et al., 1982; Kelly, 1996; Schulte, 2002]). The paradigm here is to envision a potentially infinite data stream of labeled items, a question Q about that stream, and a method M that proposes an answer to Q given each finite initial sequence of the data. If the data sequence consists of a series of letters of the alphabet, one question might be whether every “A” in the sequence is followed by a “B”; then the issue is whether there is a method for answering that question after each datum, a method that will eventually give the correct answer from that point on. A second paradigm assumes one has an initial known subjective probability distribution satisfying certain more or less weak conditions along with a method for updating one’s probabilities, e.g. by conditionalization, and proves theorems about the results of such a method [Savage, 1954; Jeffrey, 2004]. Statistical learning theory, which is our topic, represents a third paradigm which assumes there is an unknown objective probability distribution that characterizes the data and the new cases about which inferences are to be made, the goal being to do as well as possible in characterizing the new cases in terms of that unknown objective probability distribution. The basic theory attempts to specify what can be proved about various methods for using data to reach conclusions about new cases. Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
834
Gilbert Harman and Sanjeev Kulkarni
We will be concerned with what we take to be basic statistical learning theory, which is concerned with what can be proved about various inductive methods, given minimal assumptions about the background probability distribution. We are interested in results that hold no matter what that background probability distribution is (as long as the minimal assumptions are satisfied). In other words, we are interested in “worst case” results: even in the worst case, such and such holds. We begin by sketching certain aspects of basic statistical learning theory, then comment briefly on philosophical implications for reliability theories of justification, Popper’s [1979; 2002] appeal to falsifiability, simplicity as relevant to inductive inference, and whether inductive inference from observed cases to a conclusion about a new case presupposes an inference to a generalization covering old and new cases. PATTERN RECOGNITION One basic problem discussed in statistical learning theory is the pattern recognition problem: “How can data be used to find good rules for classifying new cases on the basis of the values of certain features of those cases?” As we have indicated, the simplest version of the problem presupposes that there is an unknown statistical probability distribution that specifies probabilistic relations between feature values of each possible case and its correct classification and also specifies how likely various cases are to come up either as data or as new cases to be classified. In this simplest version, these probabilities are assumed to be identically distributed and are independent of each other, but no other assumption about the probability distribution is made. For example, the Post Office wants to use machines to sort envelopes on the basis of handwritten zip codes. The data are samples of actual handwritten zip codes as they have appeared on envelopes that have been received by the Post Office, samples that have been classified by human operators as representing one or another zip code (or as not representing a zip code). In a standard version of the problem the handwritten cases are digitized and presented as an N by M grid of light intensities, so there are N × M features, the value of each of which is specified as an intensity of light in the corresponding pixel of the grid. A rule for classifying new cases maps each possible value of the N × M features into a particular zip code (or into a decision that the features shown do not represent a zip code). The feature values of a possible case can be represented as a vector x ¯ = (x1 , x2 , . . . , xD ) or equivalently as a point in a D-dimensional feature space whose coordinates are specified by x ¯. The ith co-ordinate of a point in the feature space corresponds to the value of the ith feature for an item represented by that point. A pattern recognition rule can be thought of as a rule for assigning labels to each point in the feature space, where the label assigned to the point represents the rule’s verdict for a case with those features.
Statistical Learning Theory as a Framework for the Philosophy of Induction
835
In order to evaluate pattern recognition rules, an assumption is needed about the value of getting the right answer and the cost of getting the wrong answer. In many applications, the value of any right answer is taken to be zero, and the cost of any wrong answer is set at one. In other cases, different values and costs might be assigned. (For example, in some medical diagnostic contexts, a false negative verdict might be assigned a greater cost than a false positive verdict.) Pattern recognition rules are then better or worse depending on the expected value of using them, where expectations are determined by the values and costs of right and wrong answers, the probabilities of the various cases, and the probabilities that the answer given by the rules are correct for each of those cases. If the value of right answers is set at zero, the best pattern recognition rules have the least expected costs. If the value of right answers is set at zero and the cost of wrong answers is set at 1, then the expected cost corresponds to the probability of error. In this case, the best pattern recognition rules have the smallest probability of error. The Post Office may want to assign different values and costs to correct or incorrect decisions about various cases, but on the other hand may be satisfied (at least at the beginning) to try to find rules with the least probability of error. The best pattern recognition rules for a given problem are those rules with the least expected cost. Any rule of this sort is called a Bayes rule. There is always at least one such rule. (There is more than one Bayes rule only if for at least one possible case the expected value of two different decisions about the case are tied.) Without loss of generality, we might assume that we are only concerned with the probability of error of a rule. So, in the rest of our discussion we suppose we are interested in minimizing the probability of error of decisions about new cases. If one is concerned with yes/no issues, so that each item is either an instance of a certain category or not, we can represent a yes classification as 1 and a no classification as 0. In this case, a pattern recognition rule is equivalent to a specification of the set of points in the feature space that the rule classifies as yes or 1. In the post office example, one might be concerned with whether a given written figure should be classified as a “9” (yes=1) or not (no=0). BAYES ERROR RATE R* To review: We suppose each item has D features x1 , x2 , . . . , xD , where each feature xi takes real values . (In the Post Office example, there are N ×M features, each of which might take any of as many as 256 intensity values.) We represent the values of the features of a given item with the feature vector x ¯ = (x1 , x2 , . . . , xD ). We can also represent the values of the features of an item as a point in a D-dimensional feature space. In our simplified case, a pattern recognition rule maps each feature vector x¯ into a yes/no decision. We use 1 for yes and 0 for no. The rule maps points in the feature space into 1 and 0. We can specify the rule by specifying the set of points that it maps into 1.
836
Gilbert Harman and Sanjeev Kulkarni
We suppose there is a statistical probability distribution P specifying the actual statistical probability relations among feature values and the correct classification of items whose features have those values and also specifying how likely items with those feature values are to come up. A rule with minimal probability of error, i.e. a Bayes rule, maps x¯ into 1, if P (1|¯ x) > P (0|¯ x), and maps x ¯ into 0, if P (1|¯ x) < P (0|¯ x). (It does not matter what a Bayes rule maps x ¯ into when these conditional probabilities are equal.) Applying this rule to a fixed observed x ¯, the probability of error is the smaller of P (0|¯ x) and P (1|¯ x), i.e., min{P (0|¯ x), P (1|¯ x)}. Therefore, the overall probability of error of a Bayes decision rule, the Bayes error rate, denoted by R∗ is X R∗ = P (¯ x) min{P (0|¯ x), P (1|¯ x)} x ¯
In cases involving probability densities, as specified by a probability density function p, the Bayes error rate is given by Z ∗ R = min{P (0|¯ x), P (1|¯ x)} p(¯ x) d¯ x
USING DATA TO LEARN THE STATISTICAL PROBABILITY DISTRIBUTION? The best pattern recognition rule is determined by the statistical probability distribution P . We assume that this distribution is initially unknown. In this case then, we need to use data in order to come up with a good rule. Suppose that various cases arise with known values of various observable features and that an “expert” tells us the correct classification of these cases. We want to be able to use the data, once we have enough of it, to find a rule that will do well on new cases (as assessed by the same “expert”) given their feature values with performance close to that of the Bayes rule. An idea that doesn’t usually work is to use such data to learn the probability distribution P , which is then used to find the Bayes rule. The idea would be to appeal to the probabilistic “law of large numbers,” from which we can infer that in the long run the observed frequency with which a feature vector x¯ is classified as 1 will converge to the statistical probability of its being classified as 1. In other words, with probability approaching 1, for any ǫ however small, there will be a t such that after t occurrences of the feature vector x ¯, the frequency with which the label associated with x ¯ is 1 will be within ǫ of the statistical probability of the label’s being 1. If there are finitely many points in the feature space, each representing an event with the features x ¯, then we might consider labeled items as they come to our attention and note the frequencies with which each given x¯ is a 1. In each case,
Statistical Learning Theory as a Framework for the Philosophy of Induction
837
the frequency that a given x¯ is found to be a 1 will converge to its statistical probability of being 1. Because there are only finitely many such events, there will be uniform convergence of these frequencies. In other words, with probability approaching 1, for any ǫ however small, there will be a t such that after t items of data, the frequency of each x ¯ being 1 will be within ǫ of its probability of being 1. Unfortunately, this can take a very long time. For example, in Post Office case, suppose there the grid of pixels were only 10 by 100. Then if there were 256 possible intensity values, there would be 2561000 different feature vectors. Even if there were only 2 possible intensity values, on and off, there would still be 21000 > 10300 different feature vectors. Given that the current age of the universe is less than 1020 seconds, there would not be time for even a tiny fraction of those feature vectors to show up in anyone’s lifetime! We need a method that does not require having to learn the whole probability distribution. Statistical learning theory is the theory of such methods. EMPIRICAL RISK MINIMIZATION What is a good way to choose a rule of pattern recognition that will maximize expected value, or in our special case, to minimize probability of error? One initially appealing idea might be simply to choose some rule or other that has the least error on cases that are included in the data. But too many rules have that property. For any given rule, there will be other rules making the same decisions on cases in the data but making all possible decisions on cases not part of the data. Any workable inductive procedure must therefore have some sort of inductive bias. It must favor certain rules of pattern recognition over others that make exactly the same decisions for cases in the data. The crudest inductive bias simply restricts the relevant pattern recognition rules to a selected class of rules C. Then a policy of empirical risk minimization, given some data, selects a rule from C with minimal cost on the data. But of course not all choices of the class of rules C are equally good. As we have just observed, if C contains all possible rules, a policy of enumerative induction or empirical risk minimization fails to give any advice at all about new cases (or amounts to choosing classifications of new cases at random). On the other hand, if C does not contain all rules, it may fail to contain a Bayes rule—a best rule of pattern classification in this instance. Indeed, it may fail to contain anything whose performance is close to a Bayes rule. It might even contain only the worst possible rules! But let us put this last problem to the side for a moment. We will come back to it shortly. Consider this question: what has to be true of the set C of rules so that, no matter what the unknown background probability distribution, empirical risk minimization eventually does as well as possible with respect to the rules in C? More precisely, what has to be true of the set C in order to guarantee that, with
838
Gilbert Harman and Sanjeev Kulkarni
probability approaching 1, no matter what the unknown probability distribution, given more and more data, the probability of error for the rules that empirical risk minimization endorses at each stage eventually approaches the minimum value of probability of error of rules in C? A fundamental result of statistical learning theory is that the set of rules in C cannot be too rich, where the richness of C is measured by its VC-dimension. Let us explain.
Shattering and VC-dimension Recall that we are restricting attention to rules that map a set of D feature values into a yes/no verdict for a particular categorization, where we use 1 for yes and 0 for no. As we have mentioned, we can represent such a rule in terms of a D dimensional feature space in which each point is labeled 1 or 0. Each point in that space represents a possible set of features. The ith co-ordinate of a point in the feature space corresponds to the value of the ith feature for that point. The label attached to the point represents that rule’s verdict for a case with those features. Now consider a set of N points in a given feature space and consider the 2N possible ways to label those points as yes or no. If for each possible way of labeling those points there is a rule in C that agrees with that labeling, we say that the rules in C shatter those N points. To say that the rules in C shatter a particular set of points is to say that no possible assignment of verdicts to those points (those possible cases) “falsifies” the claim that some rule in C correctly represents those verdicts, to use terminology from Popper [1979; 2002]. The finite VC-dimension of a set of rules C is the largest finite number N for which some set of N points is shattered by rules in C. If C has no finite VCdimension, its VC-dimension is infinite.
Fundamental Result Recall that empirical risk minimization says to select a rule in C with least error on the data. Vapnik and Chervonenkis [1968] show that empirical risk minimization works, no matter what the unknown statistical probability distribution is, if, and only if, C has finite VC dimension. More precisely (subject to mild measurability conditions): If and only if C has finite VC dimension: with probability approaching 1, then, no matter what the unknown probability distribution, given more and more data, the probability of error for the rules that empirical risk minimization endorses at each stage eventually approaches the minimum value of probability of error of rules in C. The Vapnik Chervonenkis result also provides information about how much data are needed for empirical risk minimization to produce a good result no matter what
Statistical Learning Theory as a Framework for the Philosophy of Induction
839
the unknown statistical probability distribution is. If the rules in C have VC dimension V , there will be a function m(V, ǫ, δ) indicating the maximum amount of data needed (no matter what the unknown statistical probability distribution) to ensure that the probability is less than δ that enumerative induction will endorse a hypothesis with an probability of error rate that exceeds the minimum probability of error rate for rules in C by more than ǫ. (A smaller ǫ indicates a better approximation to the minimum error error for rules in C and a smaller δ indicates a higher probability that the rules endorsed will be within the desired approximation to that minimum probability of error.) Where there is such a function m(V, ǫ, δ) there is what has come to be called “Probably Approximately Correct” (or “PAC”) learning (terminology due to [Valiant, 1984]).
Example: Perceptron Learning A perceptron is a simple classifier that takes the weighted sum of the D input feature values (along with an additional constant input value) and outputs +1 for yes if the result of the weighted sum is greater than some threshold T and outputs 0 for no otherwise. Given data, it is easy to find a threshold and weights for such a perceptron that yield the least error (or cost) on that data. Any classification rule that can be represented by such a perceptron divides the D dimensional feature space into two regions, the yes region and the no region, where the regions are separated by a line, plane, or hyperplane. In other words, a perceptron classifier linearly separates the points in the feature space. The class C in this case is the class of linear separation rules in that feature space. The VC-dimension of such linear separations of a D dimensional feature space is D + 1, which is finite, so the the result mentioned in the previous subsection applies. We can know how many items of data are needed in order probably to approximate the best such separation. Here we can return to the worry we temporarily put aside, because it applies to perceptron learning. The worry is that the class C of rules used for empirical risk minimization may fail to include the best rule, the Bayes rule and may indeed contain no rule whose performance is even close to that of the Bayes rule. Obviously, many possible classification rules are not and cannot be approximated by linear separation rules. The XOR rule is a well known example. Suppose there are two features, x1 and x2 , each of which takes a real value between −1 and +1, and the correct classification rule, which is also the Bayes rule, takes the value 1 if and only if x1 × x2 < 0. Here there is a two dimensional feature space in which the points in the upper left quadrant and the lower right quadrant are to be labeled 1 and the points in the upper right quadrant and lower left quadrant are to be labeled 0. Clearly there is no way to draw a line separating the points to be labeled 1 from those to be labeled 0. Indeed, if the statistical probability density is evenly distributed over the feature space, any linear separation has a significant probability of error while the Bayes rule has a probability of error of 0!
840
Gilbert Harman and Sanjeev Kulkarni
Example: Feed-forward Neural Network Learning This last worry can be somewhat alleviated by using a feed-forward neural network with several layers of perceptrons. Inputs go to perceptrons in the first layer whose outputs are inputs to perceptrons in the second layer, and so on to a final perceptron that outputs 0 or 1 depending on whether the weighted sum of its inputs are above or below a certain threshold. A fixed network with fixed connection strengths between nodes and fixed thresholds can be used to classify inputs as 0 or 1, so such a network therefore represents a classification rule. As before, varying connection strengths and thresholds (while retaining all connections) yields other rules. There are more or less good methods for finding a rule represented by a given structure of perceptrons that has least error on given data. These learning methods typically involve smoothing the threshold functions of all but the final perceptron and use a kind of “gradient descent” that may or may not become stuck in a “local minimum” that is not a global minimum. In any case there will be one or more setting of connection strengths and thresholds that minimizes error on the data. It can be shown that, the set of all rules represented by possible ways of varying connection strengths and thresholds for any particular feed forward network has a finite VC-dimension. So, the PAC learning criterion is satisfied. Furthermore, any (nonpathological) classification rule can be approximated arbitrarily closely by feed forward neural networks with enough layers by adding more nodes. So it is possible to ensure that the error rate of the best rule that can be represented by such a network is not far from the error rate of the best possible rule, the Bayes rule. Of course, adding nodes increases the VC dimension of such a network, which means more data will be needed to guarantee a certain level of performance under the PAC criterion. So, there is a trade-off between the amount of data needed to satisfy the PAC criterion and how closely the error rate of the best rule in C is to the Bayes error rate. Furthermore, no matter how large the VC dimension of C, as long as it is finite, there is no guarantee that, no matter what the background statistical probability distribution, the error rate of the rule selected via empirical risk minimization will converge to the Bayes error. DATA COVERAGE BALANCED AGAINST SOMETHING ELSE Instead of using pure empirical risk minimization, an alternative learning procedure balances empirical error or cost on the data against something else—often called simplicity, although (as we will see) that is not always a good name for the relevant consideration—and then allows C to have infinite VC dimension. There are versions of this strategy which, with probability approaching 1, will eventually come up with rules from C whose probability of error approaches that of the Bayes rule in the limit. However, because the rules in C have infinite VC
Statistical Learning Theory as a Framework for the Philosophy of Induction
841
dimension, the PAC result does not hold. So “eventually” might be a very long time. One version of this alternative strategy is concerned with rules that can be specified in a particular notation. Each rule in C is assigned a number, its length as measured by the number of symbols used in its shortest description in that notation. Given data, the procedure is to select a rule for which (for example) the sum of its empirical error (or cost) on the data plus its length is minimal. Another version identifies C with the union of a nested series of classes each of finite VC-dimension C1 ⊂ C2 ⊂ · · · ⊂ Cn ⊂ · · ·, where the VC-dimension of Ci is less than the VC-dimension of Ci+1 . Given data, the procedure is then to select a rule that minimizes the sum of its empirical error (or cost) on the data plus the number i of the smallest class Ci to which the rule belongs. This version is called structural risk minimization. These two kinds of ordering can be quite different. If rules are ordered by description length, linear separations will be scattered throughout the ordering, because the length of a description of a linear rule will depend on the number of symbols needed to specify various constants in those rules. In the ordering of classes of rules by VC-dimension, linear separations can be put ahead of quadratic separations, for example, because linear separations in a given space have a lower VC-dimension than quadratic separations in the same space. We return later to the question of whether any approach of this sort is best described as balancing simplicity against data-coverage.
Example: Support Vector Machines The final perceptron in a feed forward neural network makes a linear decision in the space represented by its inputs. The earlier parts of the feed forward network can therefore be thought of as mapping the feature space into the space for which the final perceptron makes a linear decision. This suggests a more general strategy for solving pattern recognition problems. Map the feature space into another space—perhaps a space with many more dimensions—and then make a linear decision in that space that minimizes error on the data. It is true that the more dimensions to the other space, the higher the VC-dimension of the rules represented by linear separations in that space. However, support vector machines get around this problem by using wide margin separations instead of just any linear separation. Linear separations are hyperplanes that have no thickness. Wide margin separations are thick hyperplanes, hyperslabs. Vapnik [2006] observes that, if the relevant points in a space are confined to a hypersphere of a given size, then the VC-dimension of wide margin separations of those points are often much less than the VC-dimension of pure linear separations. Even if the space in question has infinitely many dimensions, the VC-dimension of wide-margin separations of points in that hypersphere is finite and inversely related to the size of the margin, the thickness of the hyperslabs.
842
Gilbert Harman and Sanjeev Kulkarni
Support vector machines first map the feature space to a very large or infinite dimensional space in which images of points in the feature space are confined to a hyper volume of fixed radius. Then a wide margin separation of points in the transformed space is selected by trading off empirical error on the data against the width of margins of the best wide margin separations.
Transduction The learning methods discussed so far use labeled data to find a rule that is then used to classify new cases as they arise. Furthermore, these methods all involve learning total classifications. Nearest neighbor methods, perceptrons, multilayer feed-forward networks, and standard support vector machines all yield rules that assign a classification to every possible set of features. We could modify some of these methods to provide only partial classifications. For example, we could modify support vector machines so as not to choose among the various separating hyperplanes internal to the selected separating hyperslab. The points in this between space would be left unclassified. The system would still be an inductive method, since it would classify some, perhaps many, new cases in accordance with a rule derived from labeled data, but the rule would not be a total rule, since it would not characterize all points in the space. Suppose we are using a method that in this way provides only a partial classification of cases and a case arises to be classified in the intervening space of previously unclassified cases. Vapnik [1998; 2000; 2006] considers certain transductive methods for classifying such new cases, methods that use information about what new cases have come up to be classified and then select a subset of separations that (a) correctly classify the data and (b) agree on their classifications of the new cases. In one version, the selected separations also (c) disagree as much as possible on the classifications of other possible cases. An important related version of transduction uses not only the information that certain new cases have come up to be classified but also the information that there is a certain set U (“universum”) of examples that are hard to classify. In this version, transduction selects the subset of linear separations satisfying (a) and (b) but disagreeing as much as possible on the classification of the hard cases in U . Transduction performs considerably better than other methods in certain difficult real-life situations involving high-dimensional feature spaces where there is relatively little data.
PHILOSOPHICAL IMPLICATIONS We conclude by briefly mentioning a few ways in which statistical learning theory has philosophical implications. It is relevant to reliability theories of epistemic justification, it helps to situate some of Popper’s often misunderstood appeals to falsifiability, it casts doubt on claims about the importance of simplicity consid-
Statistical Learning Theory as a Framework for the Philosophy of Induction
843
erations in inductive reasoning, and it illuminates discussion of direct inductive inference.
Reliability Many philosophical epistemologists (e.g., [Goldman, 1986; 2002; Bishop and Trout, 2005]) argue that epistemology should be concerned with the reliability of methods of belief formation (or more generally of belief revision). We believe that the relevant notion of reliability makes sense only in relation to an appeal to something like the sort of background statistical probability distribution that figures in the pattern recognition problem studied in statistical learning theory. For example, the statistical likelihood that a method will yield true answers might provide a measure of the reliability of the method. If so, statistical learning theory provides one sort of foundation for philosophical epistemology.
VC Dimension and Popperian Falsifiability There is an interesting relation between the role of VC dimension in the PAC result and the emphasis on falsifiability in Karl Popper’s writings in the philosophy of science. Popper [1934] famously argues that the difference between scientific hypotheses and metaphysical hypotheses is that scientific hypotheses are “falsifiable” in a way that metaphysical hypotheses are not. To say that a certain hypothesis is falsifiable is to say that there is possible evidence that would not count as consistent with the hypothesis. According to Popper, evidence cannot establish a scientific hypothesis, it can only “falsify” it. A scientific hypothesis is therefore a falsifiable conjecture. A useful scientific hypothesis is a falsifiable hypothesis that has withstood empirical testing. Recall that enumerative induction requires a choice of a set of rules C. That choice involves a “conjecture” that the relevant rules are the rules in C. If this conjecture is to count as scientific rather than metaphysical, according to Popper, the class of rules C must be appropriately “falsifiable.” Many discussions of Popper treat his notion of falsifiability as an all or nothing matter, not a matter of degree. But in fact Popper does allow for degrees of difficulty of falsifiability [2002, sections 31-40]. For example, he asserts that a linear hypothesis is more falsifiable — easier to falsify — than a quadratic hypothesis. This fits with VC theory, because the collection of linear classification rules has a lower VC dimension than the collection of quadratic classification rules. However, Popper’s measure of degree of difficulty of falsifiability of a class of hypotheses does not quite correspond to VC-dimension (Corfield et al, 2005). Where the VC-dimension of a class C of hypotheses is the largest number N such that some set of N points is shattered by rules in C, what we might call the “Popper dimension” of the difficulty of falsifiability of a class is the largest number N such that every set of N points is shattered by rules in C. This difference between some
844
Gilbert Harman and Sanjeev Kulkarni
and every is important and VC-dimension turns out to be the key notion rather than Popper-dimension. Popper also assumes that the falsifiability of a class of hypotheses is a function of the number of parameters used to pick out instances of the class. This turns out not to be correct either for Popper dimension or VC dimension, as discussed below. This suggests that Popper’s appeal to degree of falsifiability would be improved by adopting VC-dimension as the relevant measure in place of his own measure.
Simplicity We now want to say something more about Popper’s [1972; 2002] discussion of scientific method. Popper argues that there is no justification for any sort of inductive reasoning, but he does think there are justified scientific methods. In particular, he argues that a version of structural risk minimization best captures actual scientific method (although of course he does not use the term “structural risk minimization”). In his view, scientists accept a certain ordering of classes of hypotheses, an ordering based on the number of parameters needing to be specified to be able to pick out a particular member of the class. So, for example, for real value estimation on the basis of one feature, linear hypotheses of the form y = ax + b have two parameters, a and b, quadratic hypotheses of the form y = ax2 + bx + c have three parameters, a, b, and c, and so forth. So, linear hypotheses are ordered before quadratic hypotheses, and so forth. Popper takes this ordering to be based on “falsifiability” in the sense at least three data points are needed to “falsify” a claim that the relevant function is linear, at least four are needed to “falsify” the claim that the relevant function is quadratic, and so forth. In Popper’s somewhat misleading terminology, data “falsify” a hypothesis by being inconsistent with it, so that the hypothesis has positive empirical error on the data. He recognizes, however, that actual data do not show that a hypothesis is false, because the data themselves might be noisy and so not strictly speaking correct. Popper takes the ordering of classes of hypotheses in terms of parameters to be an ordering in terms of “simplicity” in one important sense of that term. So, he takes it that scientists balance data-coverage against simplicity, where simplicity is measured by “falsifiability” [Popper, 2002, section 43]. We can distinguish several claims here. (1) Hypothesis choice requires an ordering of nested classes of hypotheses. (2) This ordering represents the degree of “falsifiability” of a given class of hypotheses. (3) Classes are ordered in accordance with the number of parameters whose values need to be specified in order to pick out specific hypotheses.
Statistical Learning Theory as a Framework for the Philosophy of Induction
845
(4) The ordering ranks simpler hypotheses before more complex hypotheses. Claim (1) is also part of structural risk minimization. Claim (2) is similar to the appeal to VC dimension in structural risk minimization, except that Popper’s degree of falsifiability does not coincide with VC dimension, as noted in above. As we will see in a moment, claim (3) is inadequate and, interpreted as Popper interprets it, it is incompatible with (2) and with structural risk minimization. Claim (4) is at best terminological and may just be wrong. Claim (3) is inadequate because there can be many ways to specify the same class of hypotheses, using different numbers of parameters. For example, linear hypotheses in the plane might be represented as instances of abx + cd, with four parameters instead of two. Alternatively, notice that it is possible to code a pair of real numbers a, b as a single real number c, so that a and b can be recovered from c. That is, there are functions such that f (a, b) = c, where f1 (c) = a and f2 (c) = b.1 Given such a coding, we can represent linear hypotheses as f1 (c)x+f2 (c) using only the one parameter c. In fact, for any class of hypotheses that can be represented using P parameters, there is another way to represent the same class of hypotheses using only 1 parameter. Perhaps Popper means claim (3) to apply to some ordinary or preferred way of representing classes in terms of parameters, so that the representations using the above coding functions do not count. But even if we use ordinary representations, claim (3) conflicts with claim (2) and with structural risk minimization. To see this, consider the class of sine curves y = a + sin(bx) that might be used to separate points in a one dimensional feature space, represented by the points on a line between 0 and 1. Almost any n distinct points in this line segment are shattered by curves from that class. So this class of sine curves has infinite “falsifiability” in Popper’s sense (and infinite VC-dimension) even though only two parameters have to be specified to determine a particular member of the set, using the sort of representation Popper envisioned. Popper himself did not realize this and explicitly treats the class of sine curves as relatively simple in the relevant respect [1934, Section 44]. The fact that this class of sine curves has infinite VC dimension (as well as infinite falsifiability in Popper’s sense) is some evidence that the relevant ordering of hypotheses for scientific hypothesis acceptance is not a simplicity ordering, at least if sine curves count as “simple”.
Transduction as Direct Induction Vapnik [2000, p. 293] says that transduction does not involve first inferring an inductive generalization which is then used for classification. Harman [1965; 1967] argues that any such inference should always be treated as a special case of inference to the best explanation where the relevant sort of explanation appeals to a generalization. But the apparent conflict here appears to be merely terminological. 1 For
c.
example, f might take the decimal representations of a and b and interleave them to get
846
Gilbert Harman and Sanjeev Kulkarni
Transduction differs from the other inductive methods we have been discussing in this way: the classification of new cases is not always based on an inductive generalization from labeled data alone. That is because transduction also makes use of the information that certain new cases have come up to be assessed. On the other hand, transduction does involve the implicit acceptance of a generalization G, corresponding to the selected subset of separations in the transformed higher dimensional space. So, transduction does involve inductive generalization, even if not inductive generalization from the labeled data alone, since it makes use of the extra information that certain new cases have come up to be assessed. It is true that, although the data include what new cases have come up, the classifications that transduction gives to these new cases are not treated as data. When additional new cases arise, transduction applied to the old plus the new cases can modify the classifications. It might therefore be said that the principle G derived from accepting the new classifications is hostage to the new cases in a way that inductive generalizations from labeled data are not. But transduction treats the fact that certain new cases have come up as data and new data always have the potentiality to change what rule should be accepted. In other words, there is a sense in which transduction does not involve inductive generalization, because the relevant generalization is not arrived at from the labeled data alone, and there is a sense in which transduction does involve inductive generalization, because it does arrive at a general rule based on labeled data plus information about what new cases have come up. What is important and not merely terminological is that, under certain conditions, transduction gives considerably better results in practice than those obtained from methods that use labeled data to infer a rule which is then used to classify new cases [Joachims, 1999; Vapnik, 2000; Weston et al., 2003; Goutte et al., 2004].
CONCLUSION To summarize, we began by sketching certain aspects of statistical learning theory and then commented on philosophical implications for reliability theories of justification, on Popper’s appeal to falsifiability, on simplicity as relevant to inductive inference, and on whether direct induction is an instance of inference to the best explanation.
BIBLIOGRAPHY [Bishop and Trout, 2005] M. A. Bishop and J. D. Trout. Epistemology and the Psychology of Human Judgment. Oxford: Oxford University Press, 2005. [Corfield et al., 2005] D. Corfield, B. Schölkopf, and V. Vapnik. Popper, Falsification and the VC-dimension. Max Planck Institute for Biological Cybernetics Technical Report No. 145, 2005. [Devroye et al., 1996] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.
Statistical Learning Theory as a Framework for the Philosophy of Induction
847
[Goldman, 1986] A. Goldman. Epistemology and Cognition. Cambridge, MA: Harvard University Press, 1986. [Goldman, 2002] A. Goldman. Pathways to Knowledge: Private and Public. Oxford: Oxford University Press, 2002. [Goutte et al., 2004] C. Goutte, N. Cancedda, E. Gaussier, and H. Dèjean. Generative vs Discriminative Approaches to Entity Extraction from Label Deficient Data. JADT 2004, 7es Journ‘ees internationales d’Analyse statistique des Donn‘ees Textuelles, Louvain-la-Neuve, Belgium, 10-12 mars, 2004. [Harman and Kulkarni, 2007] G. Harman and S. Kulkarni. Reliable Reasoning: Induction and Statistical Learning Theory, Cambridge, MA: MIT Press, 2007. [Hastie et al., 2001] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2001. [Jeffrey, 2004] R. Jeffrey. Subjective Probability (The Real Thing). Cambridge, England: Cambridge University Press, 2004. [Joachims, 1999] T. Joachims. Transductive Inference for Text Classification Using Support Vector Machines. In I. Bratko and S. Dzeroski, editors, Proceedings of the 16th International Conference on Machine Learning: 200-9. San Francisco: Morgan Kaufmann, 1999. [Kelley, 1996] K. T. Kelley. The Logic of Reliable Inquiry. Oxford: Oxford University Press, 1996. [Osherson et al., 1982] D. Osherson, M. Stob, and S. Weinstein. Systems That Learn. Cambridge: MIT Press, 1982. [Popper, 1979] K. Popper. Objective Knowledge: An Evolutionary Approach. Oxford: Clarendon Press, 1979. [Popper, 2002] K. Popper. The Logic of Scientific Discovery (London: Routledge, 2002), 2002. [Putnam, 1963] H. Putnam. Degree of Confirmation and Inductive Logic. In The Philosophy of Rudolf Carnap, ed. A. Schillp. LaSalle, Indiana: Open Court, 1963. [Reichenbach, 1949] H. Reichenbach. The Theory of Probability. Berkeley: University of California Press, 1949. [Savage, 1954] L. J. Savage. The Foundations of Statistics. New York: Wiley, 1954. [Schulte, 2002] O. Schulte. Formal Learning Theory. Stanford Encyclopedia of Philosophy Edward N. Zalta, editor, 2002. http://plato.stanford.edu/. [Valiant, 1984] L. G. Valiant. A Theory of the Learnable, Communications of the ACM 27, pp. 1134-1142, 1984. [Vapnik and Chervonenkis, 1968] V. Vapnik and A. Ja. Chervonenkis. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities (in Russian), Doklady Akademii Nauk USSR 181, 1968. Translated into English as “On the uniform convergence of relative frequencies of events to their probabilities”, Theory of Probability and Its Applications 16 (1971), pp. 264-280. [Vapnik, 1998] V. Vapnik. Statistical Learning Theory. New York: Wiley, 1998. [Vapnik, 2000] V. Vapnik. The Nature of Statistical Learning Theory, second edition. New York, Springer, 2000. [Vapnik, 2006] V. Vapnik. Estimation of Dependencies Based on Empirical Data, 2nd edition. New York: Springer, 2006. [Weston et al., 2003] J. Weston, F. Pèrez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff, and B. Schölkopf. KDD Cup 2001 Data Analysis: Prediction of Molecular Bioactivity for Drug Design-Binding to Thrombin, 2003.
This page intentionally left blank
TESTABILITY AND STATISTICAL LEARNING THEORY Daniel Steel Emerging in the 1960s from attempts to model learning in neural networks and machines, statistical learning theory subsequently expanded into a more general theory of learning from statistical data [Vapnik, 2000]. Statistical learning theory is of philosophical interest in a number of ways, but especially because one of its central concepts, Vapnik-Chernovenkis (VC) dimension, bears more than a passing resemblance to Karl Popper’s notion of testable or falsifiable theories [Corfield et al., 2005; Harman and Kulkarni, 2007, 50-52]. In this essay, I explore this connection with an emphasis on the underlying motivations of the two concepts. The concept of VC dimension is involved in two central results in statistical learning theory: the first identifies finite VC dimension as a necessary condition for long run convergence, while the second shows how a preference for lower VC dimension can improve “the rate of convergence” [Vapnik, 2000, 21, 83]. When viewed from this broad perspective, there are striking similarities in motivation between VC dimension in statistical learning theory and Popper’s falsificationism. Popper’s prohibition on unfalsifiable theories was grounded in the conviction that testability is necessary for science to progress towards closer approximations of the truth, or “verisimilitude” [1963]. Since unfalsifiable theories have infinite VC dimension, there is an obvious link between the first central result of statistical learning theory and one of the central tenets of Popper’s philosophy of science. But the similarity between Popper and statistical learning theory does not end here. Popper emphasized that testability comes in degrees, and he thought that a preference for more testable theories would increase the rate of scientific progress. Moreover, the concept of VC dimension is similar though not equivalent to Popper’s concept of theory dimension, which was one of his proposals about how to make the notion of degrees of testability more precise [1959, chapter 6], and both are promoted as means for making convergence more efficient. But despite these similarities, there are some important differences in the technical details and, what is of greater interest here, in the fundamental aims of Popper’s falsificationism and statistical learning theory. Popper’s falsificationism is a staunchly scientific realist perspective driven by the goal of enhancing the efficiency of scientific progress towards truth [Popper, 1963]. In contrast, statistical learning theory aims to minimize the expected errors of predictions [Vapnik, 2000, 18]. On the face of it, these aims are very different. For example, the aim of minimizing expected predictive error can justify selecting a simple hypothesis Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
850
Daniel Steel
even when it is known that the truth is more complex [Vapnik, 2000, 116]. The similarity of the concepts of VC and Popper dimension, therefore, raises some intriguing questions about the connection between predictive accuracy and efficient convergence to the truth. In the final section of this essay, I explain how an account of Ockham’s razor from the perspective of formal learning theory provides some insights on this issue.
VC DIMENSION Statistical learning theory aims to provide a general model of what Vapnik informally terms “learning from examples” or more formally a “function estimation model.” This model consists of the following three elements. 1. A generator (G) of random vectors x ∈ Rn , drawn independently from a fixed but unknown probability function F (x). 2. A supervisor (S) who returns an output value y to every input vector x, according to a conditional distribution function F (y|x), also fixed but unknown. 3. A learning machine (LM) capable of implementing a set of functions f (x, α), α ∈ Λ, where Λ is a set of parameters. [Vapnik, 2000, 17] In other words, the learning machine is fed a set of data points x and corresponding values of y generated from the unknown conditional distribution function F (y|x), and then selects a function from the set f (x, α) and uses it to make predictions about future data generated by F (y|x). For example, suppose the problem is to infer gender from heights and weights. In this case, the vector x consists of height-weight pairs, while y indicates gender, which for simplicity we can suppose must be either male or female. Then the “training” data consists of a set of height-weight pairs together with the gender for some number of people. Then the learning machine attempts to select the function that will do the best job of predicting gender by height and weight in subsequent batches of data. This is a simple example of what is known as pattern recognition. The learning machine typically restricts attention to functions that satisfy a certain functional form, for instance, linear functions. As a result, the conditional distribution function F (y|x) cannot be assumed to be one of the alternatives that the learning machine may choose. The aim, then, is not discover F (y|x) but instead to select the parameters so that the function f (x, α) that minimizes expected predictive error. Expected predictive error of a function f (x, α), denoted by R(α), is defined as follows. Z R(α) = L(y, f (x, α))dF (x, y)
Testability and Statistical Learning Theory
851
The quantity L(y, f (x, α)) is called the loss function. In pattern recognition, the loss function equals 0 if y = f (x, α) and 1 otherwise. In other types of inference problem, the loss function is defined differently, for example, in regression estimation it is (y − f (x, α))2 . The aim, then, is to select the function f (x, α) that has the smallest value of R(α). However, since F (x) and F (y|x) (and hence F (x, y)) are unknown, it is not possible to find the function that minimizes predictive error through a direct application of the above equation. Instead, the training data are the basis for deciding which function f (x, α) to select. The most straightforward way to do this is to choose the function that minimizes error on the training data. The error on the training data for a function f (x, α) is denoted by Remp (α) and defined as follows. n 1X L(yi , f (xi , α)) Remp (α) = n i=1
In this equation, n is the number of observations in the training data. Thus, in pattern recognition, the function that minimizes Remp (α) is simply the one that makes the fewest mistakes in the training data. The expression empirical risk minimization (ERM) is used to refer to selecting a function f (x, α) that minimizes error on the training data as just explained. Statistical learning theory, then, focuses on the following three questions [Vapnik, 2000, 21]: 1. Under what conditions does ERM ensure Remp (α) converges to R(α) in the long run (i.e. when is ERM “consistent”)? 2. What is the rate of the convergence and how can it be controlled? 3. How can algorithms be constructed to control the rate of convergence?
The concept of VC dimension is connected to the answers to these questions, especially question number 2. VC dimension is defined in terms of the concept of shattering. The best way to approach these rather intricate concepts is by means of a simple example. Consider again our example of predicting gender on the basis of height and weight. In this example, height-weight pairs can be represented by points on a plane that are labeled + for female or − for male. Suppose we restrict attention to functions that group data points into +’s and −’s by drawing straight lines between them, which we can call the linear functions. Now suppose that we are given three pairs of heights and weights, represented as three points on the plane, but we are not yet told the genders of the people whose heights and weights we have been given (see figure 1). In this situation, shattering a set of data points means being able to draw a straight line between them so that all the +’s are on one side of the line and all the −’s are on the other, no matter which points are labeled + and which −. In other words, for each possible way of assigning the +’s and −’s, there is a straight line that separates them. Thus the intuitive origin of
852
Daniel Steel
∗
∗
∗ Figure 1. Three points representing height-weight pairs for three individuals
the term “shatter”: if the set of functions shatters the set of data points, then one is assured of being able to “break down” the data points according to the values of variable one wishes to predict. It is easy to see that the set of three data points in figure 1 is shattered by the linear functions (see figure 2). However, not every configuration of three data points is shattered by the linear functions, for example, they do not shatter a set of three perfectly collinear points (see figure 3). In addition, no set of four data points is shattered by the linear functions. To see this, consider a set of four data points. If the data points are perfectly collinear, then they are not shattered by the set of straight lines for the same reason that three perfectly collinear points are not. Suppose then that the four points are not collinear. But in that case the linear functions do not shatter them either (see figure 4). Thus, in this example, the linear functions shatter some sets of three data points, but shatter no sets of four or more data points. The concept of shattering is used to define VC dimension. A set of functions has a VC dimension h if and only if the set of functions shatters some group of h many data points and no group of more than h data points. Thus, in the example described in the height-weight-gender example the VC dimension of the linear functions is three.
+
-
+
+
-
+
-
+ +
Figure 2. The set of linear functions shatters these three data points
−
+ −
Figure 3. The linear functions do not shatter three perfectly collinear data points
Testability and Statistical Learning Theory
+
−
853
+
− Figure 4. The linear functions do not shatter any set of four data points Given the above example, the general definitions of shattering and VC dimension can be stated more intelligibly. Consider a set of functions f (x, α), α ∈ Λ and a set of data points x1 , . . ., xn . This set of functions is said to shatter x1 , . . ., xn if, for any possible assignment of values to y1 , . . ., yn , there is an α ∈ Λ such that f (x, α) makes no errors on this set of data. In the example above, f (x, α) is the set of linear functions, so the parameters α would include the slope and intercept. The data points (the xi ’s) are the height-weight pairs, and the yi ’s are the genders represented by the +’s and −’s. The VC dimension h of a set of functions f (x, α), α ∈ Λ, then, is the maximum number such that some set of h data points can be shattered by f (x, α), α ∈ Λ.1 Notice that the definitions of shattering and VC dimension do not assume that f (x, α) is the set of linear functions — it could be any set of functions. Nor do they assume that data points consist of pairs of measurements (e.g. of height and weight). Each data point might consist of one measurement or three or however many one likes. The concept of VC dimension is connected to questions 1 through 3 above by way of the some central mathematical results of statistical learning theory. The first is that ERM ensures that Remp (α) converges to R(α) only when the VC dimension of the set of functions is finite [Vapnik 2000, 80; Harman and Kulkarni, 2007, 48-49]. In other words, finite VC dimension is a necessary condition for being assured of converging to the most predictively accurate function in the long run. As Vapnik explains, finite VC dimension is closely related to Popper’s concept of a falsifiable hypothesis [Vapnik, 2000, 47-55]. That point will be discussed in further detail below. The second connection has to do with controlling the rate of convergence. Although it is good to know that one’s methods will hone in on the right answer in the long run, it is also very important, and perhaps even more important, to think about how the method behaves in the meantime. This issue is linked to the classic problem of how to balance maximizing fit with the data versus over fitting [Forster and Sober, 1994]. When fitting a function to a set of data points, it is better other things being equal to prefer functions that classify those points more accurately. In pattern recognition, for instance, ERM would direct one to select the function that makes the fewest errors. But fit with the data is not the only relevant consideration. Because there is randomness in the data, some of the observed variation does not reflect the underlying relationship that one wants to approximate. Thus, selecting 1 This is actually the definition of VC dimension for indicator functions, in which y must be either 1 or 0. See Vapnik [2000, 80] for the generalization of this definition to real functions.
854
Daniel Steel
a function that perfectly fits all variation in the data is not necessarily a good idea, since some of that variation may not indicate anything about what to expect in the future. Thus, it is a commonplace that curve fitting ought to involve some balance between good fit with the data and some procedure for avoiding over fitting. In statistical learning theory, VC dimension is central to reducing the chance of over fitting the data. The higher the VC dimension of a set of functions, the greater its capacity to fit the data. For example, while the linear functions do not shatter the four data points in figure 4, those data points would be shattered by sets of higher order polynomial functions. To make this point more concrete, consider functions that divide the data points into +’s and −’s by drawing parabolas between them, which we can call the parabolic functions. The VC dimension of the parabolic functions in the height-weight-gender example is four. Thus, selecting the function from a set with higher VC dimension increases the chances of having a better fit with the data, but it also increases the chance of over fitting. The challenge, then, is to decide how to balance these two conflicting concerns. In statistical learning theory, this challenge is addressed by deriving a probabilistic upper limit on the expected predictive error given the error in the training data and VC dimension [Vapnik, q2000, chapter 4]. More specifically, with proba-
.2 bility 1 − η, R(α) ≤ Remp (α) + h(ln(2n/h)+1)−ln(η/4) n Recall that R(α) indicates the expected predictive error of the function f (x, α), Remp (α) represents the error of that function on the training data, h is the VC dimension of the set from which functions are selected, and finally n is the number of data points in the training data. To see how this result would be applied, consider again the example of predicting gender on the basis of height and weight. In that example, the VC dimension h of the set of linear functions is three. If we chose instead to select a function from the parabolic functions, h would equal four. So, the question is which set of functions to select our function from, for example, just the linear functions or a set including both linear and parabolic functions. Choosing a set with lower VC dimension will result in a smaller value q h(ln(2n/h)+1)−ln(η/4) for , but sets with lower VC dimension will also typically n have a higher training error Remp (α). Thus, the above inequality allows one to choose the set of functions that yields the lowest probabilistic upper bound on the expected predictive error, R(α).
Testability In The Logic of Scientific Discovery, Popper characterized a testable or falsifiable theory as one that is capable of being “refuted by experience” [1959, 18]. As is well known, Popper insisted that only falsifiable theories were genuinely scientific, but not because they were more likely to be true. To the contrary, Popper emphasized that the more falsifiable a theory is, the more it sticks its neck out, 2 Like other results from statistical learning theory, this assumes that past and future data are generated by the same independent and identically distributed probability distribution.
Testability and Statistical Learning Theory
855
and hence the more improbable it is [1963, 217-220]. Popper took this as proof that science does not aim primarily for probable theories, but instead for highly informative ones. The reason for preferring falsifiable theories, according to Popper, is that by so doing we further scientific progress. This point comes out in The Logic of Scientific Discovery in connection with what Popper terms “conventionalist stratagems” [1959, 57-61]. Conventionalism treats scientific theories as true by definition, so that if some apparent conflict arises between the theory and observation, that conflict must be resolved by rejecting something other than the theory. Popper admitted that there is no logical contradiction to be found in conventionalism, but he argued that it was nevertheless highly problematic on methodological grounds. In particular, conventionalism would obstruct the advancement of scientific knowledge, and we should therefore firmly commit to rules of scientific method that disallow conventionalist stratagems [1959, 61-62]. This connection between falsifiability and scientific progress is easy to appreciate in light of some of Popper’s favorite examples of theories that failed to abide by his strictures, for instance, Marxism and Freudian psychology. According to Popper, the Marxist and Freudian traditions were case studies of how treating scientific theories as unquestionable truths could lead researchers into a morass of ad hoc explanations that stultified the advancement of knowledge. Popper set out his view of scientific progress in greater detail in Conjectures and Refutations [1963, 231-248]. The central theme of that proposal is quite simple. Scientific progress in Popper’s sense occurs when a scientific theory is refuted and replaced with another that is closer to the truth. As we do not have direct access to the truth, progress would typically be judged by some more indirect means. For example, suppose that one theory T is refuted and replaced by another T ∗ such that (1) T ∗ passes all of the severe tests that T passed, (2) T ∗ passes the tests that T failed, and (3) T ∗ makes new predictions that turn out to be correct. If this happens, then Popper thought that we have good reason to say that T ∗ is closer to the truth than T . For example, Popper thought that Einstein’s General Theory of Relativity satisfied these conditions with respect to Newtonian Mechanics. It is obvious that falsifiable theories are a necessary ingredient in this picture of scientific progress. On Popper’s account, the advancement of science is driven by refuting theories and replacing them with better theories that generate new discoveries. Therefore, unfalsifiable theories — or theories we decide to save at all costs by means of “conventionalist stratagems” — halt progress in its tracks. Moreover, Popper believed that this process of conjectures and refutations would, in the long run, lead scientists closer and closer to the truth, although we may never know at any given time how close we are (or aren’t). Popper’s claimed, then, that testable theories are necessary if science hopes to converge to the truth in the long run. It is easy to see an analogy between Popper’s claim on this score and the result from statistical learning theory that finite VC dimension is a necessary condition for long run convergence to the function that minimizes expected predictive error. A set of functions Φ would be unfalsifiable if, for any possible set of data, there is a function in Φ that can fit that data with zero
856
Daniel Steel
error. Recall that the VC dimension h of a set of functions is the maximum number such that some set of h data points can be shattered by that set. An unfalsifiable set of functions, then, would have no such maximum and hence would have infinite VC dimension. Thus, a basic result of statistical learning theory coincides with Popper’s intuition that falsifiability is a necessary ingredient for being assured of homing in on the truth in the long run. Vapnik concludes his discussion of the relationship between falsifiability and statistical learning theory by remarking “how amazing Popper’s idea was” [2000, 55]. Popper also proposed that the falsifiability or testability of theories could come in degrees. Degrees of testability are clearly important for Popper’s vision of scientific progress. For when one theory is refuted, there may be several possible replacements and Popper would presumably recommend that we choose the most testable of the viable alternatives. Moreover, Popper’s reasoning naturally suggests that going with the most testable theory would accelerate scientific progress. After all, a barely testable theory might not halt progress altogether, but it could certainly slow it down. Indeed, the idea that degrees of testability are linked to the rate of scientific progress is hinted at in the epigraphs to Conjectures and Refutations. Experience is the name everyone gives to their mistakes. Oscar Wilde Our whole problem is to make the mistakes as fast as possible. . . John Archibald Wheeler More testable theories rule out more and typically will be refuted faster than less testable ones. So, it is easy to guess Popper’s meaning here: the more testable our theories, the faster our mistakes, and the more rapid the advancement of science. In The Logic of Scientific Discovery, Popper suggested two grounds for comparing degrees of testability [1959, chapter 6]. The first was a subclass relation. For instance, the theory that planets move in circles around the sun is a subclass of the theory that they move in ellipses, and hence the former theory is more easily refuted (i.e. more testable) than the latter. However, it is the second of Popper’s suggestions for how to compare degrees of testability that is most pertinent to our concerns here. Popper proposed that the dimension of a theory be understood in terms of the number of data points needed to refute it. More specifically, if d + 1 is the minimum number of data points needed to refute the theory t, then the Popper dimension of t is d [1959, 113-114]. The difference between Popper and VC dimension can be neatly made in terms of shattering. Suppose we think of theories as sets of functions. If the Popper dimension of a theory of functions is d, then no set of only d data points can refute the theory and hence the theory shatters every group of d many data points. On the other hand, if the VC dimension of the theory is h, then that set shatters some but not necessarily all groups of h many data points. This difference is illustrated by the example of predicting gender by height and weight discussed above. In that example, the linear functions shatter every set of two data points, some but not all sets of three data points, and no sets
Testability and Statistical Learning Theory
857
of four data points. Consequently, the Popper dimension of the linear functions in this case is two, while the VC dimension is three. Further divergences between Popper and VC dimension occur in cases in which data points consist of more than two measurements. For example, suppose we wanted to predict diabetes on the basis of blood pressure, body mass index, and cholesterol level. In this case, the data points would be spread out in a three dimensional space, and the linear functions would separate data points with flat planes. In this situation, the Popper dimension of the linear functions remains two (since a flat plane cannot separate three perfectly collinear points in a three dimensional space) but the VC dimension of the linear functions in this case would be four.3 However, there are some cases in which Popper and VC dimension coincide. For instance, consider a very simple example in which one wishes to predict the colors of balls drawn from an urn, which may be either blue or red.4 In this example, the xi ’s consist of balls drawn from the urn and the yi ’s of an indication of the color (blue or red) of each ball. Suppose that 99 balls have been drawn so far, and all are red. The functions, then, tell us what to predict about the colors of the future balls given this data. What we might call the inductive function directs us to predict that all future balls will be red. Another set of functions directs us to predict that the balls will switch at some future time from red to blue and stay blue from then on. We can call these the anti-inductive functions. Notice that in this example a fixed number of data points cannot be arranged into distinct configurations as in figures 2 and 3, and hence there is no difference between shattering some and shattering all configurations of n many data points. As a result, Popper and VC dimension are equivalent in this case. For example, the inductive function does not shatter the next data point, since it would be refuted if the next ball is blue. Hence, its Popper and VC dimension is zero. In contrast, the anti-inductive functions can shatter the next data point, since the switch from red to blue might begin with the next ball or it might begin later. However, the anti-inductive functions do not shatter the next two data points, since none of them can accommodate a switch to blue followed by an immediate switch back to red. Thus, the Popper and VC dimension of the anti-inductive functions is one. Let us sum up the similarities and differences between VC dimension and Popper’s notion of degrees of testability, beginning with the similarities. The two concepts are similar in spirit, coincide in some simple examples, and track one another in some other examples. In addition, similar claims are made on behalf of both: testability and finite VC dimension are claimed to be necessary for convergence in the long run, and a preference for lower Popper dimension (i.e. greater testability) and lower VC dimension are both said to promote the faster convergence. Moreover, there is a further similarity between Popper and VC dimension that is rarely remarked upon: both of these concepts presuppose without explana3 To see this, arrange the points in the following way. Put three of the four points on the same plane but not collinear, and place the forth point on a different plane from the other three: then you can separate any assignment of +’s and −’s with a flat plane. 4 This example is borrowed from Goodman [1946]].
858
Daniel Steel
tion some natural or preferred way of dividing data up into points or units. Since different modes of expression might result in different ways of carving up data into units, this means that neither concept is language invariant.5 Now let us turn to the differences between Popper and VC dimension. First, there are differences in technical details of the two concepts, as explained above. Furthermore, Popper never provided any precise articulation or proof of his claims about the link between testability and convergence, while Vapnik and others have done this for VC dimension. Thus, statistical learning theory represents a very significant step forward from Popper’s work. Finally, there are also some important differences in philosophical motivation that I will discuss in the next section.
Scientific Realism, Empirical Risk Minimization, and Simplicity The conceptual connection between falsificationism and statistical learning theory is intriguing and suggestive. In one sense, the similarity is understandable given that both are proposed as a means for promoting faster convergence to a correct answer. In another sense, however, the correspondence is surprising because different things are meant by “correct answer” in the two cases. For Popper, convergence meant convergence to the truth. In statistical learning theory, convergence means convergence to the function that has the least expected predictive error in comparison to the other functions in a given set. Moreover, in Popper’s scheme a refuted theory should be replaced by another that might be true given what we know, while in statistical learning theory it may be appropriate to select a function from a set that we know does not contain the true one. For example, we might minimize expected predictive error by choosing a linear function even if it is known that the true conditional distribution function is not linear [Kelly and Mayo-Wilson, 2008; Vapnik, 2000, 116]. In short, scientific realism and minimizing predictive risk are two very different motivations. The question posed by the similarity between Popper’s philosophy of science and statistical learning theory is whether there is some important connection between these two different aims. In this section, I consider how insights on this topic follow from a three way comparison between Popper, statistical learning theory, and a recent account of empirical simplicity from the literature on formal learning theory [2004; 2007a; 2007b]. Kevin Kelly proposes a general account of Ockham’s razor — the claim simpler theories should be preferred to more complex ones — in terms of efficient convergence to the truth. Efficiency in Kelly’s account is understood primarily in terms of minimizing the maximum number of retractions of conjecture, or “mind changes.” The central idea is that a reliable learner who conjectures a more complex theory before a simpler one can be forced by the data to revert to the simpler theory and then to return again to the more complex one. In contrast, a learner who goes with the simpler theory first cannot be forced to loop around in this way, but can only be forced by the data to move directly from simpler to more com5 Popper seems to have recognized that this problem [1959, 283-284]. For a suggestion about how to eliminate language variance from the concept of VC dimension, see [Steel, 2009].
Testability and Statistical Learning Theory
859
plex. For instance, consider the example of drawing red or blue balls from an urn discussed above. According to Kelly’s approach, Ockham’s razor would require selecting the hypothesis that all the balls are red in this example. For suppose that we chose instead to conjecture the anti-inductivist alternative that the balls start out red and then permanently switch to blue at some future point. Then if learner is reliable, she will eventually have to revert to conjecturing that all the balls are red given a sufficiently large number of further observations of red and no blue balls. (If she did not, she could continue conjecturing the false hypothesis forever and hence not be reliable.) Yet once she has reverted to conjecturing that all the balls are red, subsequent observations of blue balls can force her to change her mind again. In contrast, if the learner begins with “all of the balls are red,” then she cannot be forced by the data to loop around in this way.6 There are several conceptual links between Kelly’s account of Ockham’s razor and Popper’s falsificationism. Like Popper, Kelly’s approach is motivated by the aim of efficient convergence to truth. Moreover, in the ball drawing example, the Popper dimension of the hypothesis that all the balls are red is zero, while the Popper dimension of the hypothesis that they are all red until some future point and then permanently switch to blue is one. Thus, in this example, Kelly’s account of Ockham’s razor and Popper’s claim that scientists should conjecture the most testable viable alternative yield the same recommendation. In addition, Popper and VC dimension coincide in the ball drawing example. An in fact, this example is just one instance of a more general class of cases in which Ockham’s razor as construed by Kelly, Popper dimension, and VC dimension all lead to the same result [Steel, 2009].7 Given the connections between Kelly’s account of Ockham’s razor, Popper’s concept of testability, and VC dimension described here, one might wonder whether objections to Popper’s falsificationism would also be objections to either of these other two approaches. I will consider this question only in relation to the topic of simplicity. As is well known, Popper proposed that degrees of simplicity be equated with degrees of testability [1959, 126]. According to Popper, this approach had the advantage not only of providing a clearer definition of the concept of simplicity but also of leading to a principled explanation of why simplicity was important for science. One of Popper’s arguments in favor of this proposal was that his concept of theory dimension corresponds to geometrical dimension [1959, 110-115, 129] and that geometrical dimension is often associated with simplicity. Geometrical dimension has to do with the number of free parameters needed to define a geometrical class (e.g. the straight lines or the parabolas). For instance, in the height-weight-gender example, the Popper dimension of the parabolic functions is three while the Popper dimension of the linear functions is two, and the equation 6 This
analysis of Goodman’s new riddle of induction was first published in an essay by Schulte [1999]. 7 However, Kelly’s analysis of Ockham’s razor differs from both Popper and VC dimension on the issue of language variance. Kelly develops a language invariant definition of empirical complexity in terms of the number further mind changes that the data could force a reliable learner to make in an inductive problem [2007b].
860
Daniel Steel
used to define the set of parabolas (i.e. y = a + bx + cx2 ) has three free parameters while the equation used to define the straight lines (i.e. y = a + bx) has two free parameters. Moreover, the linear functions are naturally thought of as simpler than the parabolic functions. However, subsequent authors have conclusively shown that Popper was mistaken in supposing that theory dimension corresponds to geometric dimension in general; indeed this correspondence does not even hold for some of the examples that Popper gave to illustrate it [Turner, 1991]. As a result of such criticisms, Popper’s attempt to explain simplicity by way of testability is usually regarded as a failure. Both statistical learning theory and Kelly’s learning theoretic approach to Ockham’s razor avoid the above difficulty, but in different ways. Statistical learning theory does not claim that lower VC dimension always corresponds to greater simplicity and hence does not commit itself to explaining simplicity in terms of VC dimension [Harman and Kulkarni, 2007, 69-73]. Thus, from the perspective of statistical learning theory, Popper was mistaken to think that testability is connected to simplicity in some systematic and general way. According to Kelly’s approach, by contrast, Popper had the right intuition that efficient convergence to the truth is the central justification for a scientific preference for simpler theories. Popper’s mistake, from Kelly’s perspective, was to suppose that the concept of complexity relevant to Ockham’s razor — which Kelly refers to as empirical complexity — corresponds to geometrical dimension. As the survey given here shows, statistical learning theory offers many new insights regarding the concept of testability and such issues as predictive accuracy, efficient convergence to truth, and simplicity. It is, therefore, worthy of further sustained interest from philosophers of science. ACKNOWLEDGMENTS I would like to thank Malcolm Forster and an anonymous referee for helpful comments and suggestions on an earlier draft of this essay. BIBLIOGRAPHY [Corfield et al., 2005] D. Corfield, B. Sch¨ olkopf, and V. Vapnik. Popper, Falsification and the VC-Dimension. Max Plank Institute of Biological Cybernetics, Technical Report No. 145, 2005. [Forster and Sober, 1994] M. R. Forster and E. Sober. How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions. British Journal for the Philosophy of Science 45: 1-35, 1994. [Goodman, 1946] N. Goodman. A Query on Confirmation. Journal of Philosophy 43: 383-385, 1946. [Harman and Kulkarni, 2007] G. Harman and S. Kulkarni. Reliable Reasoning: Induction and Statistical Learning Theory, MIT Press, Cambridge, MA, 2007. [Kelly, 2004] K. Kelly. Justification as Truth-Finding Efficiency: How Ockham’s Razor Works. Minds and Machines 14: 485-505, 2004. [Kelly, 2007a] K. Kelly. A New Solution to the Puzzle of Simplicity. Philosophy of Science 74: 561-573, 2007.
Testability and Statistical Learning Theory
861
[Kelly, 2007b] K. Kelly. Ockham’s Razor, Empirical Complexity, and Truth-finding Efficiency. Theoretical Computer Science 383: 270-289, 2007. [Kelly and Mayo-Wilson, 2008] K. Kelly and C. Mayo-Wilson. Reliable Reasoning: Induction and Statistical Learning Theory (Book Review). Notre Dame Philosophical Reviews, 2008. (http://ndpr.nd.edu/reviews.cfm). [Popper, 1959] K. Popper. The Logic of Scientific Discovery, Routledge, New York, 1959. [Popper, 1963] K. Popper. Conjectures and Refutations: The Growth of Scientific Knowledge. Routledge, New York, 1963. [Schulte, 1999] O. Schulte. Means-Ends Epistemology. British Journal for the Philosophy of Science 50: 1-31, 1999. [Steel, 2009] D. Steel. Testability and Ockham’s Razor: How Formal and Statistical Learning Theory Converge in the New Riddle of Induction. Journal of Philosophical Logic 38: 471-489. 2009. [Turney, 1991] P. Turney. A Note on Popper’s Equation of Simplicity with Falsifiability. The British Journal for the Philosophy of Science 42: 105-109, 1991. [Vapnik, 2000] V. Vapnik. The Nature of Statistical Learning Theory, Springer, New York, 2000.
This page intentionally left blank
Part XI
Different Approaches to Simplicity Related to Inference and Truth
This page intentionally left blank
LUCKINESS AND REGRET IN MINIMUM DESCRIPTION LENGTH INFERENCE Steven de Rooij and Peter D. Gr¨ unwald 1
INTRODUCTION
Suppose we have been observing some phenomenon for a while, and we now consider a number of alternative hypotheses to explain how the data came about. We want to use the data somehow to evaluate these hypotheses and decide which is the “best” one. In case that one of the hypotheses exactly describes the true mechanism that underlies the phenomenon, then that is the one we hope to find. While this may already be a hard problem, available hypotheses are often merely approximations in practice. In that case the goal is to select a hypothesis that is useful, in the sense that it provides insight in previous observations, and matches new observations well. Of course, we can immediately reject hypotheses that are inconsistent with new experimental data, but hypotheses often allow for some margin of error; as such they are never truly inconsistent but they can vary in the degree of success with which they predict new observations. A quantitative criterion is required to decide between competing hypotheses. The Minimum Description Length (MDL) principle is such a criterion [Rissanen, 1978; Rissanen, 1989]. It is based on the intuition that, on the basis of a useful theory, it should be possible to compress the observations, i.e. to describe the data in full using fewer symbols than we would need using a literal description. According to the MDL principle, the more we can compress a given set of data, the more we have learned about it. The MDL approach to inference requires that all hypotheses are formally specified in the form of codes. A code is a function that maps possible outcomes to binary sequences; thus the length of the encoded representation of the data can be expressed in bits. We can encode the data by first specifying the hypothesis to be used, and then specifying the data with the help of that hypothesis. Suppose that L(H) is a function that specifies how many bits we need to identify a hypothesis H in a countable set of considered hypotheses H. Furthermore let LH (D) denote how many bits we need to specify the data D using the code associated with hypothesis H. The MDL principle now tells us to select that hypothesis H for which the total description length of the data, i.e. the length of the description of the hypothesis L(H), plus the length of the description of data using that hypothesis LH (D), is shortest. The hypothesis yielding the minimum total description length as a function of the data is denoted Hmdl , and the resulting code length is Lmdl : Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
866
(1) Lmdl (D)
Steven de Rooij and Peter D. Gr¨ unwald
=
L(Hmdl ) + LHmdl (D)
=
min L(H) + LH (D) .
H∈H
Intuitively, the term L(H) represents the complexity of the hypothesis while LH (D) represents how well the hypothesis is able to describe the data, often referred to as the goodness of fit. By minimising the sum of these two components, MDL implements a tradeoff between complexity and goodness of fit. Also note that the selected hypothesis only depends on the lengths of the used code words, and not on the binary sequences that make up the code words themselves. This will make things a lot easier later on. By its preference for short descriptions, MDL implements a heuristic that is widely used in science as well as in learning in general. This is the principle of parsimony, also often referred to as Occam’s razor. A parsimonious learning strategy is sometimes adopted on the basis of a belief that the true state of nature is more likely to be simple than to be complex. We point out right at the outset that MDL is not based on such a belief. Instead, we try to avoid assumptions about the truth as much as possible, and focus on effective strategies for learning instead. As explained in Section 5.2, even if the truth is “complex”, it is often a very effective strategy to prefer “simple” hypotheses for small sample sizes. That said, a number of issues still need to be addressed if we want to arrive at a practical theory for learning: 1. What code should be used to identify the hypotheses? 2. What codes for the data are suitable as formal representations of the hypotheses? 3. After answering the previous two questions, to what extent can the MDL criterion actually be shown to identify a “useful” hypothesis, and what do we actually mean by “useful”? These questions, which lie at the heart of MDL research, are illustrated in the following example. EXAMPLE 1 Language Learning. Ray Solomonoff, one of the founding fathers of the concept of Kolmogorov complexity (to be discussed in Section 2.4), introduced the problem of language inference based on only positive examples as an example application of his new theory [Solomonoff, 1964]; it also serves as a good example for MDL inference. We will model language using context-free grammars (introduced as “phrase structure grammars” by Chomsky in [Chomsky, 1956]). A grammar is a set of production rules, each of which maps a symbol from a set of nonterminal symbols N to a (possibly empty) replacement sequence consisting of both other nonterminals and terminal symbols, which are words from some dictionary Σ. A sentence is grammatical if it can be produced from the starting symbol, which is a particular nonterminal, by iteratively applying a production rule to one of the matching nonterminal symbols. The following grammar is an example. It uses the standard
Luckiness and Regret in Minimum Description Length Inference
867
abbreviation that multiple rules for the same nonterminal may be combined using a pipe symbol, i.e. two rules n → r1 and n → r2 are written n → r1 | r2 . The empty sequence is denoted ǫ.
(2)
Sentence Nounphrase Determiner Adjectives Adjective Noun Verbphrase Verb
→ → → → → → → →
Nounphrase Verbphrase Determiner Adjectives Noun | Adjectives Noun the | a Adjective Adjectives | ǫ big | complex | careful | curvy men | statistician | model Verb Nounphrase prefers | avoids
This grammar accepts such sentences as The careful statistician avoids the complex model, or The curvy model prefers big men. We proceed to infer such a grammar from a list of valid example sentences D. Assume that we already know the words of the language, i.e. Σ contains all words that occur in D and is background knowledge. The task is to use D to learn how the words may be ordered. For the set of our hypotheses H we take all context-free grammars that use only words from Σ. Note that H is a countably infinite set. We need to define the relevant code length functions: L(H) for the specification of the grammar H ∈ H, and LH (D) for the specification of the example sentences with the help of that grammar. In this example we use very simple code length functions; later in this introduction, after describing in more detail what properties good codes should have, we return to language inference with a more in-depth discussion. We use uniform codes in the definitions of L(H) and LH (D). A uniform code on a finite set A assigns binary code words of equal length to all elements of the set. Since there are 2l binary sequences of length l, a uniform code on A must have code words of length at least ⌈log |A|⌉. (Here and in the following, ⌈·⌉ denotes rounding up to the nearest integer, and log denotes binary logarithm.) To establish a baseline, we first use uniform codes to calculate how many bits we need to encode the data literally, without the help of any grammar. Namely, we can specify every word in D with a uniform code on Σ ∪ ⋄, where ⋄ is a special symbol used to mark the end of a sentence. Two consecutive diamonds signal the end of the data. Let w denote the total number of words in D, and let s denote the number of sentences. With the uniform code we need (w + s + 1)⌈log |Σ ∪ {⋄}|⌉ bits to encode the data. We are looking for grammars H which allow us to compress the data beyond this baseline value. We specify L(H) as follows. For each production rule of H, we use ⌈log |N ∪ {⋄}|⌉ bits to uniformly encode the initial nonterminal symbol, and ⌈log |N ∪ Σ ∪ {⋄}|⌉ bits for each symbol in the replacement sequence. The ⋄ symbol signals the end of each rule; an additional diamond marks the end of the entire grammar. If the grammar H has r rules, and the summed length of the replacement sequences is l, then we can calculate that the number of bits we need to encode the entire grammar is at most
868
(3)
Steven de Rooij and Peter D. Gr¨ unwald
(r + 1)⌈log |N ∪ {⋄}|⌉ + (l + r)⌈log |N ∪ Σ ∪ {⋄}|⌉.
The first term is the length needed to describe the initial nonterminal symbols of each rule; the +1 is because, to signal the end of the grammar, we need to put an extra diamond. The second term is the length needed to describe the replacement sequences; the +r is because, at the end of each rule, we put an extra diamond. If a context-free grammar H is correct, i.e. all sentences in D are grammatical according to H, then its corresponding code LH can help to compress the data, because it does not need to reserve code words for any sentences that are ungrammatical according to H. In this example we simply encode all words in D in sequence, each time using a uniform code on the set of words that could occur next according to the grammar. Again, the set of possible words is augmented with a ⋄ symbol to mark the end of the data. For example, based on the grammar (2) above, the 1-sentence data sequence D =the model avoids men is encoded using ⌈log 10⌉ + ⌈log 7⌉ + ⌈log 2⌉ + ⌈log 9⌉ + ⌈log 10⌉, because, by following the production rules we see that the first word can be a determiner, an adjective, or a noun; since we also allow for a diamond, this gives 10 choices in total. Given the first word the, the second word can be an adjective or a noun, so that there are 7 choices (now a diamond is not allowed); and so on. The final log 10 is the number of bits needed to encode the final diamond. Now we consider two very simple correct grammars, both of which only need one nonterminal symbol S (which is also the starting symbol). The “promiscuous” grammar (terminology due to Solomonoff [Solomonoff, 1964]) has rules S → S S and S → σ for each word σ ∈ Σ. This grammar generates any sequence of words as a valid sentence. It is very short: we have r = 1 + |Σ| and l = 2 + |Σ|, so the number of bits L(H1 ) required to encode the grammar essentially depends only on the size of the dictionary and not on the amount of available data D. On the other hand, according to H1 all words in the dictionary are allowed in all positions, so LH1 (D) requires as much as ⌈log(|Σ ∪ {⋄}|)⌉ bits for every word in D, which is equal to the baseline. Thus this grammar does not enable us to compress the data. Second, we consider the “ad hoc” grammar H2 . This grammar consists of a production rule S → d for each sentence d ∈ D. Thus according to H2 , a sentence is only grammatical if it matches one of the examples in D exactly. Since this severely restricts the number of possible words that can appear at any given position in a sentence given the previous words, this grammar allows for very efficient representation of the data: LH2 (D) is small. However, in this case L(H2 ) is at least as large as the baseline, since in this case the data D appear literally in H2 ! Both grammars are clearly useless: the first does not describe any structure in the data at all and is said to underfit the data. In the second grammar random features of the data (in this case, the selection of valid sentences that happen to be in D) are treated as structural information; this grammar is said to overfit the data. Consequently, both grammars H ∈ {H1 , H2 } do not enable us to compress the data at all, since the total code length L(H) + LH (D) exceeds the baseline. In contrast, by selecting a grammar Hmdl that allows for the greatest total compression as
Luckiness and Regret in Minimum Description Length Inference
869
per (1), we avoid either extreme, thus implementing a natural tradeoff between underfitting and overfitting. Note that finding this grammar Hmdl may be a quite difficult search problem, but the computational aspects of finding the best hypothesis in large hypothesis spaces are not considered in this text. ♦
1.1
Overview
The three fundamental questions of page 866 will be treated in more detail in the following text. First we discuss codes for the set of hypotheses in terms of the concepts of luckiness and regret in Section 2. Although most previous treatments put less emphasis on luckiness (or fail to mention it altogether), it is our position that both concepts are fundamental to the MDL philosophy. Then in Section 3 we discuss how the data should be encoded based on a hypothesis. This requires that we outline the close relationship between codes and probability distributions, linking information theory to statistics. We then focus on the special case that the hypothesis is formulated in terms of a model, which is a set of codes or probability distributions. We describe different methods of constructing universal codes to represent such models, which are designed to achieve low regret in the worst case. After describing the process of designing codes, in Section 4 we will try to convince you that it may be useful to try and compress your data, even if you are really interested in something different, such as truth finding or prediction of future events, which, arguably, are the primary goals of statistics. Finally in Section 5 we place Minimum Description Length in the context of other interpretations of learning.
2
ENCODING THE HYPOTHESIS: A MATTER OF LUCKINESS AND REGRET
The intuition behind the MDL principle is that “useful” hypotheses should help compress the data. For now, we will simply define “useful” to be the same as “useful to compress the data”. We postpone further discussion of what this means to Section 4. What remains is the task to find out which hypotheses are useful, by using them to compress the data. Given many hypotheses, we could just test them one at a time on the available data until we find one that happens to allow for substantial compression. However, if we were to adopt such a methodology in practice, results would vary from reasonable to extremely bad. The reason is that among so many hypotheses, there could well be one that, by sheer force of luck, allows for significant compression of the data, even though it would probably do a very bad job of predicting future outcomes. This is the phenomenon of overfitting again, which we mentioned in Example 1. To avoid such pitfalls, we required that a single code Lmdl is proposed on the basis of all available hypotheses. The code has two parts: the first part, with length function L(H), identifies a hypothesis to be used to encode the data,
870
Steven de Rooij and Peter D. Gr¨ unwald
while the second part, with length function LH (D), describes the data using the code associated with that hypothesis. Intuitively, the smaller LH (D), the better H fits data D. For example, if H is a set of probability distributions, then the higher the likelihood that H achieves on D, the smaller LH . Section 3 is concerned with the precise definition of LH (D). Until then, we will assume that we have already represented our hypotheses in the form of codes, and we will discuss some of the properties a good choice for L(H) should have. Technically, throughout this text we use only length functions that correspond to prefix codes; we will explain what this means in Section 3.1.
2.1
Avoid Regret
ˆ ∈ Consider the case where the best candidate hypothesis, i.e. the hypothesis H H that minimises LH (D), achieves substantial compression. It would be a pity ˆ because we chose a code word with if we did not discover the usefulness of H ˆ unnecessarily long length L(H). The regret we incur on the data quantifies how bad this “detection overhead” can be, by comparing the total code length Lmdl (D) to the code length achieved by the best hypothesis LHˆ (D). A general definition, ˆ is undefined, is the following: the regret of a code L on which also applies if H data D with respect to a set of alternative codes M is1 (4) R(L, M, D) := L(D) − inf L′ (D). ′ L ∈M
The reasoning is now that, since we do not want to make a priori assumptions as to the process that generates the data, the code for the hypotheses L(H) must be chosen such that the regret R(Lmdl , {LH | H ∈ H}, D) is small, whatever data we observe. This ensures that whenever H contains a useful hypothesis that allows for substantial compression of the data (more than this small regret), we detect this because Lmdl compresses the data as well. EXAMPLE 2. Suppose that H is finite. Let L be a uniform code that maps every hypothesis to a binary sequence of length l = ⌈log2 |H|⌉. The regret incurred by this uniform code is always exactly l, whatever data we observe. This is the best possible worst-case guarantee: all other length functions L′ on H incur a strictly larger regret for at least one possible outcome (unless H contains useless hypotheses H which have L(H) + LH (D) > Lmdl (D) for all possible D). In other words, the uniform code minimises the worst-case regret maxD R(L, M, D) among all code length functions L. (We discuss the exact conditions we impose on code length functions in Section 3.1.) Thus, if we consider a finite number of hypotheses we can use MDL with a uniform code L(H). Since in this case the L(H) term is the same for all hypotheses, ˆ in this case. What we have gained from this it cancels and we find Hmdl = H possibly anticlimactic analysis is the following sanity check : since we equated 1 In this text inf and sup are used for infimum and supremum, generalisations of minimum and maximum respectively.
Luckiness and Regret in Minimum Description Length Inference
871
ˆ the hypothesis that best fits learning with compression, we should not trust H, the data, to exhibit good performance on future data unless we were able to compress the data using Hmdl . ♦
2.2
Try to Get Lucky
There are many codes L on H that guarantee small regret, so the next task is to decide which we should pick. As it turns out, given any particular code L, it is possible to select a special subset of hypotheses H′ ⊂ H and modify the code such that the code lengths for these special hypotheses are especially small, at the cost of increasing the code lengths for the other hypotheses only slightly. This can be desirable, because if a hypothesis in H′ turns out to be useful, then we are lucky, and we can identify that hypothesis more readily to achieve superior compression. On the other hand, if all hypotheses in the special subset are poor, then we have not lost much. Thus, while a suitable code must always have small regret, there is quite a lot of freedom to favour such small subsets of special hypotheses. Examples of this so-called luckiness principle are found throughout the MDL literature, although they usually remain implicit, possibly because it makes MDL inference appear subjective. Only recently has the luckiness principle been identified as an important part of code design. The concept of luckiness is introduced to the MDL literature in [Gr¨ unwald, 2007]; [Barron, 1998] already uses a similar concept but not under the same name. We take the stance that the luckiness principle introduces only a mild form of subjectivity, because it cannot substantially harm inference performance. Paradoxically, it can only really be harmful not to apply the luckiness principle, because that could cause us to miss out on some good opportunities for learning! EXAMPLE 3. To illustrate the luckiness principle, we return to the grammar learning of Example 1. We will not modify the code for the data LH (D) defined there; we will only reconsider L(H) that intuitively seemed a reasonable code for the specification of grammars. Note that L(H) is in fact a luckiness code: it assigns significantly shorter code lengths to shorter grammars. What would happen if we tried to avoid this “subjective” property by optimising the worst-case regret without considering luckiness? To keep things simple, we reduce the hypothesis space to finite size by considering only context-free grammars with at most |N | = 20 nonterminals, |Σ| = 490 terminals, R = 100 rules and replacement sequences summing to a total length of L = 2000. Using (3) we can calculate the luckiness code length for such a grammar as at most 101⌈log(21)⌉ + 2100⌈log(511)⌉ = 19405 bits. However, Example 2 shows that for finite H, the worst-case regret is minimised for the uniform code. To calculate the uniform code length, we first count the number of possible grammars H ′ ⊂ H within the given size constraints. One may
872
Steven de Rooij and Peter D. Gr¨ unwald
or may not want to verify that ′
|H | =
R X r=1
r
|N |
L X l=0
l
|N ∪ Σ|
l+r−1 . l
Calculation on the computer reveals that ⌈log |H′ |⌉ = 18992 bits. Thus, we could compress the data 413 bits better than the code that we used before. This shows that there is room for improvement of the luckiness code, which may incur a larger regret than necessary. On the other hand, with the uniform code we always need 18992 bits to encode the grammar, even when the grammar is very short! Suppose that the best grammar uses only r = 10 and s = 100, then the luckiness code requires only 11⌈log(21)⌉ + 110⌈log(511)⌉ = 1045 bits to describe that grammar and therefore out-compresses the uniform code by 17947 bits: this grammar is identified much more easily using the luckiness code. A simple way to combine the advantages of both codes is to define a third code that uses one additional bit to specify whether the luckiness code or the uniform code is used. The regret of this third code is at most one bit more than minimal in the worst case, while many simple grammars can be encoded extra efficiently. This one bit is the “slight increase” that we referred to at the beginning of this section. ♦ Note that we do not mean to imply that the hypotheses which get special luckiness treatment are necessarily more likely to be true than any other hypothesis. Rather, luckiness codes can be interpreted as saying that this or that special subset might be important, in which case we should like to know about it!
2.3
Infinitely Many Hypotheses
When H is countably infinite, there can be no upper bound on the lengths of the code words used to identify the hypotheses. Since any hypothesis might turn out to be the best one in the end, the worst-case regret is often infinite in this case. In order to retain MDL as a useful inference procedure, we are forced to embrace the luckiness principle. A good way to do this is to order the hypotheses such that H1 is luckier than H2 , H2 is luckier than H3 , and so on. We then need to consider only codes L for which the code lengths increase with the index of the hypothesis. This immediately gives a lower bound on the code lengths L(Hn ) for n = 1, . . ., because the nondecreasing code that has the shortest code word for hypothesis n is uniform on the set {H1 , . . . , Hn } (and is unable to express hypotheses with higher indices). Thus L(Hn ) ≥ log |{H1 , . . . , Hn }| = log n. It is possible to define codes L with L(Hn ) = log n+O(log log n), i.e. not much larger than this ideal. Rissanen describes one such code, called the “universal code for the integers”, in [Rissanen, 1983b] (where the restriction to monotonically increasing code word lengths is not interpreted as an application of the luckiness principle as we do here); in Example 5 we describe some codes for the natural numbers that are convenient in practical applications.
Luckiness and Regret in Minimum Description Length Inference
2.4
873
Ideal MDL
In MDL hypothesis selection as described above, it is perfectly well conceivable that the data generating process has a very simple structure, which nevertheless remains undetected because it is not represented by any of the considered hypotheses. For example, we may use hypothesis selection to determine the best Markov chain order for data which reads “110010010000111111. . . ”, never suspecting that this is really just the beginning of the binary expansion of the number π. In “ideal MDL” such blind spots are avoided, by interpreting any code length function LH that can be implemented in the form of a computer program as the formal representation of a hypothesis H. Fix a universal prefix Turing machine U . With some slight caveats, computer languages such as Java, C or LISP can be thought of as universal prefix mechanisms [Gr¨ unwald and Vit´anyi, 2008]. The result of running program T on U with input D is denoted U (T, D). The Kolmogorov complexity of a hypothesis H, denoted K(H), is the length of the shortest program TH that implements LH , i.e. U (TH , D) = LH (D) for all binary sequences D. Now, the hypothesis H can be encoded by literally listing the program TH , so that the code length of the hypotheses becomes the Kolmogorov complexity. For a thorough introduction to Kolmogorov complexity, see [M.Li and Vit´anyi, 2008]. In the literature the term “ideal MDL” is used for a number of approaches to model selection based on Kolmogorov complexity; for more information on the version described here, refer to [Barron and Cover, 1991; Gr¨ unwald and Vit´anyi, 2008]. The version of ideal MDL that we adopt here tells us to pick (5)
min K(H) + LH (D),
H∈H
which is (1), except that now H is the set of all hypotheses represented by computable length functions, and L(H) = K(H). In order for this code to be in agreement with MDL philosophy as described above, we have to check whether or not it has small regret. It is also natural to wonder whether or not it somehow applies the luckiness principle. The following property of Kolmogorov complexity is relevant for the answer to both questions. Let H be a countable set of hypotheses with computable corresponding length functions. Then for all computable length functions L on H, we have (6) ∃c > 0 : ∀H ∈ H : K(H) ≤ L(H) + c. Roughly speaking, this means that ideal MDL is ideal in two respects: first, the set of considered hypotheses is expanded to include all computable hypotheses, so that any computable concept is learned given enough data; since with any inference procedure that can be implemented as a computer algorithm — as they all can — we can only infer computable concepts, this makes ideal MDL “universal” in a very strong sense. Second, it matches all other length functions up to a constant, including all length functions with small regret as well as length functions with any clever application of the luckiness principle. Thus, we may think of the Kolmogorov
874
Steven de Rooij and Peter D. Gr¨ unwald
Complexity code length K(H) as implementing a sort of “universal” luckiness principle. On the other hand, performance guarantees such as (6) are not very specific, as the constant overhead may be so large that it completely dwarfs the length of the data. To avoid this, we would need to specify a particular universal Turing machine U , and give specific upper bounds on the values that c can take for important choices of H and L. While there is some work on such a concrete definition of Kolmogorov complexity for individual objects [Tromp, 2007], there are as yet no concrete performance guarantees for ideal MDL or other forms of algorithmic inference. The more fundamental reason why ideal MDL is not practical, is that Kolmogorov complexity is uncomputable. Thus it should be understood as a theoretical ideal that can serve as an inspiration for the development of methods that can be applied in practice. 3
ENCODING THE DATA
On page 866 we asked how the codes that formally represent the hypotheses should be constructed. Often many different interpretations are possible and it is a matter of judgement how exactly a hypothesis should be made precise. There is one important special case however, where the hypothesis is formulated in the form of a set of probability distributions. Statisticians call such a set a model. Possible models include the set of all normal distributions with any mean and variance, or the set of all third order Markov chains, and so on. Model selection is a prime application of MDL, but before we can discuss how the codes to represent models are chosen, we have to discuss the close relationship between coding and probability theory.
3.1
Codes and Probability Distributions
We have introduced the MDL principle in terms of coding; here we will make precise what we actually mean by a code and what properties we require our codes to have. We also make the connection to statistics by describing the correspondence between code length functions and probability distributions. A code C : X → {0, 1}∗ is an injective mapping from a countable source alphabet X to finite binary sequences called code words. We consider only prefix codes, that is, codes with the property that no code word is the prefix of another code word. This restriction ensures that the code is uniquely decodable, i.e. any concatenation of code words can be decoded into only one concatenation of source symbols. Furthermore, a prefix code has the practical advantage that no lookahead is required for decoding, that is, given any concatenation S of code words, the code word boundaries in any prefix of S are determined by that prefix and do not depend on the remainder of S. Prefix codes are as efficient as other uniquely decodable codes; that is, for any uniquely decodable code with length function
Luckiness and Regret in Minimum Description Length Inference
875
LC there is a prefix code C ′ with LC ′ (x) ≤ LC (x) for all x ∈ X , see [Cover and Thomas, 1991, Chapter 5]. Since we never consider non-prefix codes, from now on, whenever we say “code”, this should be taken to mean “prefix code”. Associated with a code C is a length function L : X → N, which maps each source symbol x ∈ X to the length of its code word C(x). Of course we want to use efficient codes, but there is a limit to how short code words can be made. For example, there is only one binary sequence of length zero, two binary sequences of length one, four of length three, and so on. The precise limit is expressed by the Kraft inequality: LEMMA 4 Kraft inequality. Let X be a countable source alphabet. A function L : X → N is the length function of a prefix code on X if and only if: X 2−L(x) ≤ 1. x∈X
Proof. See for instance [Cover and Thomas, 1991, page 82].
If, for a given code C with length functionL, the inequality holds strictly, then the code is called defective, otherwise it is called complete. (The term “defective” is usually reserved for probability distributions, but we apply it to code length functions as well.) Let C be any prefix code on X with length function L, and define P (x) := 2−L(x) .
(7) ∀x ∈ X :
Since P (x) is always positive and sums to at most one, it can be interpreted as a probability mass function that defines a distribution corresponding to C. This mass function and distribution are called complete or defective if and only if C is. Vice versa, given a distribution P , according to the Kraft inequality there must be a prefix code L satisfying (8) ∀x ∈ X :
L(x) := ⌈− log P (x)⌉.
To further clarify the relationship between P and its corresponding L, define the entropy of distribution P on a countable outcome space X by X (9) H(P ) := −P (x) log P (x). x∈X
(This H should not be confused with the H used for hypotheses.) According to Shannon’s noiseless coding theorem [Shannon, 1948], the mean number of bits used to encode outcomes from P using the most efficient code is at least equal to the entropy, i.e. for all length functions L′ of prefix codes, we have EP [L′ (X)] ≥ H(P ). The expected code length using the L from (8) stays within one bit of entropy. This one bit is a consequence of the requirement that code lengths have to be integers. Note that apart from rounding, (7) and (8) describe a one-to-one correspondence
876
Steven de Rooij and Peter D. Gr¨ unwald
between probability distributions and code length functions that are most efficient for those distributions. Technically, probability distributions are usually more convenient to work with than code length functions, because their usage does not involve rounding. But conceptually, code length functions are often more intuitive objects than probability distributions. A practical reason for this is that the probability of the observations typically decreases exponentially as the number of observations increases, and such small numbers are hard to handle psychologically, or even to plot in a graph. Code lengths typically grow linearly with the number of observations, and have an analogy in the real world, namely the effort required to remember all the obtained information, or the money spent on storage equipment. A second disadvantage of probability theory is the philosophical controversy regarding the interpretation of probability, which is discussed in Section 5.3. In order to get the best of both worlds: the technical elegance of probability theory combined with the conceptual clarity of coding theory, we generalise the concept of coding such that code lengths are no longer necessarily integers. While the length functions associated with such “ideal codes” are really just alternative representations of probability mass functions (or even densities), and probability theory is used under the hood, we will nonetheless call negative logarithms of probabilities “code lengths” to aid our intuition and avoid confusion. Since this difference between an ideal code length and the length using a real code is at most one bit, this generalisation should not require too large a stretch of the imagination. In applications where it is important to actually encode the data, rather than just compute its code length, there is a practical technique called arithmetic coding [Rissanen, 1976] which can usually be applied to achieve the ideal code length to within a few bits; for the purposes of MDL inference it is sufficient to compute code lengths, and we do not have to construct actual codes. EXAMPLE 5. In Section 2.3 we remarked that a good code for the natural numbers always achieves code length close to log n. Consider a probability distribution W defined as W (n) = f (n) − f (n + 1) for some function f : N → R. If f is decreasing with f (1) = 1 and f → 0, then W is an easy to use probability mass function that numbers. To see this, note that Pm can be used as a basis for coding the naturalP ∞ W (i) = 1 − f (n + 1) < 1 and that therefore i=1 i=1 W (i) = 1. For f (n) = 1/n we get code lengths − log W (n) = log(n(n + 1)) ≈ 2 log n, which, depending on the application, may be small enough. Even more efficient for high n are f (n) = n−α for some 0 < α < 1 or f (n) = 1/ log(n + 1). ♦
3.2
Universal Coding and Model Selection
We have discussed earlier what code L should be used to identify the hypotheses. Here we address the question what code LH should represent a hypothesis H ∈ H. So far we have treated the data as a monolithic block of information, but it is often convenient to think of the data as a sequence of observations D = x1 , x2 , . . . , xn instead. For instance, this makes it easier to discuss sequential prediction later on.
Luckiness and Regret in Minimum Description Length Inference
877
For simplicity we will assume each observation xi is from some countable outcome space X ; we also abbreviate xn = x1 , . . . , xn . In case that the hypothesis H is given in terms of a probability distribution P from the outset, H = P , then we identify it with the code satisfying (8) but without the integer requirement, so that LH (xn ) = − log P (xn ). There are various reasons why this is the only reasonable mapping [Gr¨ unwald, 2007]. Now no more work needs to be done and we can proceed straight to doing MDL hypothesis selection as described above. In the remainder of this section, we consider instead the case where the hypothesis is given in the form of a model M = {Pθ | θ ∈ Θ}, which is a set of probability distributions parameterised by a vector θ from some set Θ of allowed parameter vectors, called the parameter space. For example, a model could be the set of all fourth order Markov chains, or the set of all normal distributions. In this case it is not immediately clear which single code should represent the hypothesis. To motivate the code that represents such a model, we apply the same reasoning as we used in Section 2. Namely, the code should guarantee a small regret, but is allowed to favour some small subsets of the model on the basis of the luckiness principle. To make this idea more precise, we define a maximum likelihood estimator as a function θˆ : X ∗ → Θ satisfying n n (10) Pθ(x ˆ n ) (x ) = max{Pθ (x ) | θ ∈ Θ}.
ˆ n ) also minimises the code length. We abbreviate θˆ = θ(x ˆ n ) if the Obviously, θ(x n sequence of outcomes x is clear from context. We start by associating with each individual Pθ ∈ M the corresponding code Lθ with Lθ (xn ) = − log P (xn ), and henceforth write and think of M as a set of codes {Lθ | θ ∈ Θ}.
DEFINITION 6. Let M := {Lθ | θ ∈ Θ} be a (countable or uncountably infinite) model with parameter space Θ. Let f : Θ × N → [0, ∞) be some function. A code L is called f -universal for a model M if, for all n ∈ N, all xn ∈ X n , we have ˆ n). R(L, M, xn ) ≤ f (θ,
Our notion of universality is called individual-sequence universality in the information-theoretic literature [Gr¨ unwald, 2007]. When information theorists talk of universal codes, they usually refer to another definition in terms of expectation rather than individual sequences [Cover and Thomas, 1991]. Our formulation is very general, with a function f that needs further specification. Generality is needed because, as it turns out, different degrees of universality are possible in different circumstances. This definition allows us to express easily what we expect a code to live up to in all these cases. For finite M, a uniform code similar to the one described in Example 2 achieves ˆ n) = log |M|. Of course we incur a small overhead on top f -universality for f (θ, of this if we decide to use luckiness codes. For countably infinite M, the regret cannot be bounded by a single constant, but we can avoid dependence on the sample size. Namely, if we define a mass function
878
Steven de Rooij and Peter D. Gr¨ unwald
ˆ n) = − log W (θ) ˆ W on the parameter set, we can achieve f -universality for f (θ, by using a Bayesian or two-part code (these are explained in subsequent sections). Finally for uncountably infinite M it is often impossible to obtain a regret bound that does not depend on n. For parametric models however it is often possible to achieve f -universality for ˆ n) = k log n + g(θ), ˆ (11) f (θ, 2 2π where k is the number of parameters of the model and g : Θ → [0, ∞) is some continuous function of the maximum likelihood parameter. Examples include the Poisson model, which can be parameterised by the mean of the distribution so k = 1, and the normal model, which can be parameterised by mean and variance so k = 2. Thus, for parametric uncountable models a logarithmic dependence on the sample size is the norm. In MDL model selection, universal codes are used on two levels. On the inner level, each model H is represented by a universal code for that model. Then on the outer level, a code with length function L(H)+LH (D) is always used, see (1). This code is also universal, this time with respect to {LH | H ∈ H}, which is typically countable. This outer-level code is called a two-part code, explained in more detail below. Thus, MDL model selection proceeds by selecting the H (a model, i.e. a family of distributions) that minimises the two-part code L(H) + LH (D), where LH (D) itself is a universal code relative to model H. We now describe the four most common ways to construct universal codes, and we illustrate each code by applying it to the model of Bernoulli distributions M = {Pθ | θ ∈ [0, 1]}. This is the “biased coin” model, which contains all possible distributions on 0 and 1. The distributions are parameterised by the probability of observing one: Pθ (X = 1) = θ. Thus, θ = 21 represents the distribution of an unbiased coin. The distributions in the model are extended to n outcomes by taking the n-fold product distribution: Pθ (xn ) = Pθ (x1 ) · · · Pθ (xn ). 3.2.1
Two-Part Codes
In Section 2 we defined a code Lmdl that we now understand to be universal for the model {LH | H ∈ H}. This kind of universal code is called a two-part code, because it consists first of a specification of an element of the model, and second of a specification of the data using that element. Two-part codes may be defective: this occurs if multiple code words represent the same data sequence xT . In that case one must ensure that the encoding function is well-defined by specifying exactly which representation is associated with each data sequence. Since we are only concerned with code lengths however, it suffices to adopt the convention that we always use one of the shortest representations. We mentioned that, in model selection problems, universal coding is applied on two levels. At the outer level, two-part coding is always used because it allows us to associate a single hypothesis with the achieved code length Lmdl (xn ). In model selection problems, two-part coding can also be used at the inner level. An
Luckiness and Regret in Minimum Description Length Inference
879
example of a two-part code that can be used as a universal code on the inner level follows below. We will see later that on this inner level, other universal codes are sometimes preferable because they are more efficient and do not require discretisation. EXAMPLE 7. We define a two-part code for the Bernoulli model. In the first part of the code we specify a parameter value, which requires some discretisation since the parameter space is uncountable. However, as the maximum likelihood parameter for the Bernoulli model is just the observed frequency of heads, at a sample size of n we know that the ML parameter is one of 0/n, 1/n, . . . , n/n. We discretise by restricting the parameter space to this set. A uniform code uses L(θ) = log(n + 1) bits to identify an element of this set. Therefore the resulting regret is always exactly log(n + 1). By using a slightly more clever discretisation we can bring this regret down to about 21 log n + O(1), which we mentioned is usually achievable for uncountable single parameter models. ♦ 3.2.2
The Bayesian Universal Distribution
Let M = {Pθ | θ ∈ Θ} be a countable model; it is convenient to use mass functions rather than codes as elements of the model here. Now define a distribution with mass function W on the parameter space Θ. This distribution is called the prior distribution in the literature as it is often interpreted as a representation of a priori beliefs as to which of the hypotheses in M represents the “true state of the world”. More in line with the philosophy outlined above would be the interpretation that W is a code which should be chosen for practical reasons to optimise inference performance. At any rate, the next step is to define a joint distribution P on X n × Θ by P (xn , θ) = Pθ (xn )W (θ). In this joint space, each outcome comprises a particular state of the world and an observed outcome. In Bayesian statistics, inference is always based on this joint distribution. For our purpose, the one relevant quantity is the marginal likelihood of the data: X (12) P (xn ) = Pθ (xn )W (θ). θ∈Θ
Note that the marginal likelihood is a weighted average, satisfying ˆ ≤ P (xn ) ≤ P ˆ(xn ), (13) Pθˆ(xn )W (θ) θ where the first inequality is obtained by underestimating (12) by a single term of the sum, and the second by overestimating by the largest item the weighted average is taken over. Together these bounds express that the Bayesian marginal ˆ worse than the best element of the model, likelihood is at most a factor W (θ) which is not very much considering that Pθˆ(xn ) is often exponentially small in n. Note that this Bayesian universal code yields a code length less than or equal to the code length we would have obtained with the two-part code with L(θ) = − log W (θ). Since we already found that two-part codes are universal, we can now conclude that Bayesian codes are at least as universal, with regret at most
880
Steven de Rooij and Peter D. Gr¨ unwald
ˆ On the flip side, the sum involved in calculating the Bayesian marginal − log W (θ). distribution can be hard to evaluate in practice. EXAMPLE 7 continued. Our definitions readily generalise to uncountable models with Θ ⊆ Rk , with the prior distribution given by a density w on Θ. Rather than giving explicit definitions we revisit our running example. We construct a Bayesian universal code for the Bernoulli model. For simplicity we use a uniform prior density, w(θ) = 1. Let n0 and n1 = n − n0 denote the number of zeroes and ones in xn , respectively. Now we can calculate the Bayes marginal likelihood of the data: Z 1 Z 1 n0 ! n1 ! . P (xn ) = Pθ (xn ) · 1 dθ = θn1 (1 − θ)n0 dθ = (n + 1)! 0 0 Using Stirling’s approximation of the factorial, we find that the corresponding code length − log P (xn ) equals − log Pθˆ(xn ) + 21 log n + O(1). Thus we find a regret similar to that achieved by a well-designed two-part code, but the constant will be slightly better. ♦ 3.2.3
Normalised Maximum Likelihood
The Normalised Maximum Likelihood universal code is preferred in the MDL literature, because it provably minimises the worst-case regret. It is not hard to show that the code minimising the worst-case regret must achieve equal regret for all possible outcomes xn . In other words, the total code length must always be some constant longer than the code length achieved on the basis of the maximum likelihood estimator [Gr¨ unwald, 2007, Chapter 6]. The Normalised Maximum Likelihood (NML) distribution is defined to achieve exactly this: (14) Pnml (xn ) := P
n Pθ(x ˆ n ) (x )
y n ∈X n
n Pθ(y ˆ n ) (y )
.
For all outcomes xn , the regret on the basis of this distribution is exactly equal to the logarithm of the denominator, called the parametric complexity of the model: X n (15) inf sup R(L, M, xn ) = log Pθ(y ˆ n ) (y ). L
xn
y n ∈X n
Under some regularity conditions on the model it can be shown [Rissanen, 1996; Gr¨ unwald, 2007] that, if we restrict the parameters to an ineccsi set Θ0 ⊂ Θ, i.e. a subset of Θ that is closed, bounded, has nonempty interior and excludes the boundary of Θ, then the parametric complexity is finite and given by Z p k n det I(θ) dθ + o(1), (16) log + log 2 2π Θ0
ˆ Here det I(θ) is the determiin accordance with (11) for constant function g(θ). nant of the Fisher information matrix, a fundamental quantity in statistics (see
Luckiness and Regret in Minimum Description Length Inference
881
[Gr¨ unwald, 2007, Chapter 6] for a precise statement). This is an important result: for example, it implies that we can never hope to construct universal codes with regret smaller than (16) on any substantial subset Θ0 of Θ. Thus, if the parametric complexity is large, then with any coding method whatsoever we need a lot of bits to code the data, on top of the best-fitting element of the model. The model is thus “complex”, independently of the particular description method we use. We may also look at this in a different way: we see from the sum in (15) that a model is complex iff for many data y n sequences, it contains a distribution n that fits these sequences well in the sense that Pθ(y ˆ n ) (y ) is large. Thus, the parametric complexity of a model is related not so much to how many distributions it contains, but to how many patterns can be fit well by a distribution in it. It can be shown [Gr¨ unwald, 2007, Chapter 6] that this is related to the number of “essentially different” distributions the model contains. We note that neither the parametric complexity (15) nor its approximation (16) depend on the particular chosen parameterisation, i.e. the function which maps θ to Pθ . While the parameterisation is arbitrary, the parametric complexity and its approximation are inherent properties of the model M under consideration.
EXAMPLE 7 continued. The parametric complexity (15) has exponentially many terms, but for the Bernoulli model the expression can be significantly simplified. Namely, we can group together all terms which have the same maximum likelihood estimator. Thus the minimal worst-case regret can be rewritten as follows: n−n1 n X X n n1 n1 n − n1 n (17) log Pθ(y (y ) = log . n ˆ ) n1 n n n n n =0 y ∈X
1
This term has only linearly many terms and can usually be evaluated in practice. Approximation by Stirling’s formula confirms that the asymptotic regret is 1 2 log n + O(1), the same as for the other universal distributions. ♦ The NML distribution has a number of significant practical problems. First, it is often undefined, because for many models the denominator in (14) is infinite, even for such simple models as the Poisson or geometric distributions. Second, X n is exponentially large in n, so calculating the NML probability exactly is only possible in special cases such as the Bernoulli model above, where the number of terms is reduced using some trick. Something similar is possible for the more general multinomial model, see [Kontkanen and Myllym¨aki, 2007], but in most cases (15) has to be approximated, which introduces errors that are hard to quantify. 3.2.4
Prequential Universal Distributions and Predictive MDL
A distribution P (or its corresponding code), defined for a sequence of outcomes xn := x1 , x2 , . . . , xn can be reinterpreted as a sequential (probabilistic) prediction strategy using the chain rule: (18)
P (xn )
2
n
P (x ) P (x ) = P (x1 ) · P (x1 ) · . . . · P (xn−1 ) 1 = P (x1 ) · P (x2 |x ) · . . . · P (xn |xn−1 ).
882
Steven de Rooij and Peter D. Gr¨ unwald
The individual factors are the actual probabilities P (x1 ), P (x2 | x1 ), . . . assigned to x1 , x2 , . . . by the conditional “predictive” probability distributions P (X1 = · | x1 ), P (X2 = · | x2 ), . . . Thus, to any distribution P for the data, there corresponds a sequential prediction strategy, where we predict x1 by marginalising P before having seen any data, then we look at x1 , condition P accordingly and marginalise again to obtain a prediction of x2 , and so on. The chain rule shows that the probability of xn must be equal to the product of the individual predictive probabilities. Vice versa, any prediction strategy that will issue a distribution on the next outcome xi+1 given any sequence of previous outcomes xi , defines a distribution (or code) on the whole data via the chain rule. Together with the correspondence between code length functions and probability distributions described above, this links the domains of coding and sequential prediction. An algorithm that, given a sequence of previous observations xn , issues a probability distribution on the next outcome P (Xn+1 |xn ), may be viewed as a sequential (probabilistic) prediction strategy. In Dawid’s theory of prequential analysis such algorithms are called prequential forecasting systems [Dawid, 1984; Dawid, 1992b; Dawid, 1992a]. A prediction strategy defines a distribution on xn by the chain rule (18). Vice versa, for any distribution P on X n , application of the chain rule yields a prediction strategy. We give two important prediction strategies here. First, we may take a Bayesian approach and define a joint distribution on X n ×Θ based on some prior distribution W . As before, this induces a marginal distribution on X n which in turn defines a prediction strategy. Thus, the Bayesian universal distribution can be reinterpreted as a prediction strategy: it is in fact an important special case of a prequential universal distribution. Second, since the Bayesian predictive distribution can be hard to compute, it may be useful in practice to define prediction strategies that use simpler algorithms. Perhaps the simplest option is to predict an outcome Xn+1 using a maximum likelihood estimator (10) evaluated on the previous outcomes. This “prequential maximum likelihood plug-in ” approach was suggested independently by Dawid and by Rissanen who calls it predictive MDL [Rissanen, 1984; Rissanen, 1986a], and we will use it in our running example. EXAMPLE 7 continued. The ML estimator for the Bernoulli model parameterised ˆ n ) = 1 Pn xn . by Pθ (X = 1) = θ, equals the frequency of ones in the sample: θ(x n=1 n We define a prediction strategy through P (Xn+1 = 1|xn ) := Pθ(x ˆ n ) (Xn+1 = 1) = ˆ n ). θ(x This strategy is ill-defined for the first outcome. Another impractical feature is that it assigns probability 0 to the event that the second outcome is different from the first. To address Pn these problems, we slightly tweak the estimator: rather than θˆ we use θ˜ = (1+ n=1 xn )/(n+2). Perhaps surprisingly, in this case the resulting prediction strategy is equivalent to the Bayesian universal distribution approach we defined in the previous section: Pθ˜ turns out to be the Bayesian predictive distribution for the Bernoulli model if a uniform prior density w(θ) = 1 is used. This is somewhat coincidental: for non-Bernoulli models, the distribution indexed
Luckiness and Regret in Minimum Description Length Inference
883
by such a “tweaked” ML estimator is usually quite different from the Bayesian predictive distribution [Gr¨ unwald, 2007, Chapter 9]. ♦ The ML plug-in code is useful because it is often easier to implement than the other universal codes: it does not require discretisation, like the 2-part code, nor integration over the parameter space, like the Bayesian code, nor summation over all possible data sequences, like NML. Instead it only requires calculation of the ML estimator. While the prequential ML code has been used successfully in practical inference problems, it is shown in [de Rooij and Gr¨ unwald, 2006] that (in expectation) it does not necessarily achieve the desired regret (11) of (k/2) log n + O(1) unless the data are actually sampled from a distribution in the model. Application of the ML plug-in code thus requires exactly the kind of assumption about the “truth” that we have been trying to avoid. Therefore the ML plug-in code should be used with care, especially in applications of model selection. Another feature of prequential universal codes is that in general they are sensitive to the ordering of the data, which may be undesirable in case the data have no natural ordering. Interestingly, this does not apply to the Bayesian sequential prediction strategy, which is order-independent in i.i.d. settings. That is: if we change the order of the data xn , then the individual predictions of xi given xi−1 change, but the product (18) of these probabilistic predictions remains the same. Unordered Data When prequential universal codes are defined for models that assume the data to be unordered (the most important case being i.i.d. data), the result often assigns different code lengths to different permutations of the data sequence, even if all codes that make up the models do not. Fortunately, the Bayesian prequential code does not suffer this effect. Neither does it occur for the 2-part and NML universal codes, which are not prequential. However the ML plug-in code length generally does depend on the order of the data. Also, several methods to turn NML codes into prediction strategies are studied in the literature [Rissanen and Roos, 2007]; the resulting prequential codes do typically depend on the order of the data. Similarly, recent attention to model switching (Section 5.5) has resulted in codes that introduce order dependence. If the data are unordered, should we not encode the data as a set rather than as a sequence, in other words should we not try to avoid encoding the (arbitrary) order altogether? We do not favour such an approach, because once the order of the data is disregarded, we can no longer compare the code lengths we obtain to those we would get using other hypotheses that do assume order dependence! It would prevent us from testing our assumption that the data are unordered. A second, more practical argument against disregarding the order is that it is often easier to encode sequences than it is to encode sets. As long as all permutations of the sequence are assigned equal code lengths, no bias is introduced when comparing code lengths for sequences rather than sets. However, as we mentioned many prequential codes do depend on the order. This means that we have to make sure that these codes still achieve low regret in
884
Steven de Rooij and Peter D. Gr¨ unwald
the worst case, i.e. that their performance does not deteriorate too much when the data appear in a particularly unlucky order. Unfortunately, for many prequential codes such worst-case results have not been proven, and we have to make do with performance guarantees either in expectation or with high probability. While this issue does not appear cause too much trouble in practice, it is slightly at odds with our interpretation of MDL philosophy. 4
THE PURPOSE OF MDL
While we have equated “learning” with “compressing the data” so far, in reality we often have a more specific goal in mind when we apply learning algorithms. This touches on the third question we asked on page 866: what do we mean by a “useful” hypothesis? In this section we will argue that MDL inference, while designed to achieve optimal compression of the data, is also suitable in some settings with a different objective, so that, at least to some extent, MDL provides a reliable one-size-fits-all solution. EXAMPLE 1 continued. We return to grammar learning. The code we used for the data given the grammar is not very sophisticated. A better approach is the following: to encode each sentence, we start with the starting symbol, and then we repeatedly specify which production rule is to be applied to the first remaining nonterminal, until none are left. We still have to choose a code/distribution on the rules that match every nonterminal, to be able to specify which matching rule is applied. Of course, in practice, some production rules will be used much more often than others, and by specifying the “right” probabilities (approximately matching empirical frequencies) for each rule, we may achieve substantial compression of the data. Such a grammar, where each nonterminal has a distribution on the matching production rules is called a stochastic context-free grammar. We generally do not know how we should set the probabilities to obtain optimal code lengths. However, we now have the tools to minimise the worst-case regret. Namely, each grammar defines a model on sentences, a set of distributions that is parameterised by the probabilities of applying production rules to matching nonterminals. For each model, we can subsequently define one of the universal codes of Section 3.2. This results in a code for each grammar that performs almost as well as the stochastic context-free grammar with optimally tuned probabilities of the production rules. Each of the applications of MDL we describe below can be usefully applied to stochastic context free grammars. (1) We may use the two-part code we defined for grammar and data to compress the data (Section 4.1). (2) We may use the grammar to predict the continuation of the data (Section 4.3). Language models are used for prediction in applications such as speech recognition. (3) We may infer a grammar that matches the data well, without attempting to learn the model parameters as well. We then obtain a model selection problem (Section 4.2.2). Or alternatively (4) we may attempt to learn not only the grammar, but the
Luckiness and Regret in Minimum Description Length Inference
885
parameters as well. This is the estimation problem of Section 4.2.1. Note that in these last two cases, theoretical consistency results only carry over to real data in case those data are really sampled from a stochastic context free grammar, which may be unrealistic. ♦
4.1
Data Compression
Until now we have considered MDL for compression, which is an important practical goal in itself. A substantial amount of progress in the data compression literature builds on MDL research. For example, the elegant Context Tree Weighting algorithm [Willems et al., 1997] expands on earlier MDL-related work by Rissanen [Rissanen, 1983a]. The following subsections discuss in detail how compression is related to learning from data, first in the form of “truth finding”, which includes estimation and model selection, and then in the form of prediction. The MDL Principle is more general though: at least in principle, it can be applied to all problems involving inductive inference, not just compression, prediction and truth finding. For example, it is eminently suited for denoising and separating structure from noise, and it has also been used for similarity analysis, clustering, lossy data compression and DNA sequence alignment. We will not go into these applications here; for a non-exhaustive list with references, see [Gr¨ unwald, 2007, Chapter 1].
4.2
Truth Finding
Let H be a set of probability distributions, and let xn be the observed data. The MDL principle provides a clear interpretation of learning without the assumption that the data are sampled from one of the distributions in H, or indeed without any probabilistic assumptions at all. It can be applied in situations in which some P ∈ H is “adequate” (in the sense that it leads to valid insights) or “useful” (in the sense that it leads to reasonable predictions of future data; such P may be quite far off from the underlying “truth”). Still, as a sanity check, we want to make sure that MDL methods provably perform well in the special case in which some P ∗ ∈ H is true, in the sense that the data are sampled from P ∗ . Below we verify that MDL methods indeed behave reasonably in such ideal circumstances. First, in Section 4.2.1, we consider the case where the goal is simply to infer a good approximation of P ∗ . This includes the traditional statistical problems of parametric and nonparametric density estimation. In that situation, MDL performs well in the sense that, with high probability, the MDL methods output a good approximation to the true P ∗ even at modest sample sizes. In Section 4.2.2 we consider model selection, where H is carved up in finite or countably many parametric models M1 , M2 , . . ., and the goal is to identify the (smallest) Mδ∗ containing the true P ∗ . In that situation, MDL performs well in the sense that, with high probability, when input a large enough sample, the MDL methods outputs Mδ∗ . Finally, in Section 4.3, we very briefly consider a version of truth finding more closely related to prediction.
886
4.2.1
Steven de Rooij and Peter D. Gr¨ unwald
Estimation
Here we assume that the true distribution P ∗ lies in the hypothesis space H, and investigate whether we can approximate P ∗ using MDL. The hypothesis space H may be a parametric model (e.g., the set of Bernoulli distributions, or the set of second-order Markov chains); in that case our goal reduces to a most familiar topic of statistics: parameter estimation. Alternatively, H may be so large that it cannot be indexed by a finite set of real-valued parameters. An example is the set of all distributions for 0 < X < 1 that have differentiable densities. In that case, our goal is what statisticians S call “nonparametric density estimation”. Yet another possibility is that H = k=1,2,... Mk , where each Mk is a k-dimensional parametric model. In this last scenario we assume the truth to be “parametric”, but the dimension is unknown. The difference with model selection proper, which is discussed further below, is that in estimation we are interested in inferring a single element P ∗ residing in some of the Mk , whereas in model selection we only try to find out which Mk contains P ∗ . For all such estimation problems we use a two-part code; we first encode an element P of H using a code length function denoted L, and then we encode the data using the code that corresponds to P . In total we need L(P ) − log P (xn ) bits to encode xn via P . We then select the P¨ ∈ H that minimises this total. This P¨ may be viewed as an estimator in the statistical sense: it maps data sequences of arbitrary length to distributions in H. In general, the performance of statistical estimators is often expressed in terms of their statistical risk, the expected distance between P¨ and the true P ∗ . For technical reasons it is convenient to measure the risk in terms of the squared Hellinger distance DHe , a distance measure between probability distributions (see [Gr¨ unwald, 2007] for a definition). With respect to this distance, the risk of the 2-part MDL estimator relative to true distribution P ∗ at sample size n is defined as h i DHe (P ∗ , P¨ ) . (19) Rn (P ∗ , P¨ ) = E X n ∼P ∗
Note that P¨ is really a function of the X n over which we take expectation. As more and more data are gathered, the 2-part MDL estimator tends to get better and better in the sense that the risk tends to get smaller and smaller. The rate of convergence expresses how quickly the risk converges to 0. We have the following: THEOREM 8. Suppose that, for all P ∗ ∈ H, data X1 , X2 , . . . are i.i.d. under P ∗ . Then, under a mild regularity condition on the code length function L(H), we have, for all P ∗ ∈ H, h i EX n ∼P ∗ L(P¨ ) − log P¨ (X n ) − [− log P ∗ (X n )] (20) Rn (P ∗ , P¨ ) ≤ . n
In words: the statistical risk of two-part code MDL relative to P ∗ is bounded by the expected overhead of the two-part code relative to P ∗ , divided by n. This provides
Luckiness and Regret in Minimum Description Length Inference
887
a deep link between data compression and truth finding: if our code happens to achieve a small code length using some approximation of the true distribution P ∗ (i.e. the right-hand side of (20) is small), then the expected performance of the two-part estimator is very good (i.e., the left-hand side of (20) is small). For example, if P ∗ has a finite code length L(P ∗ ) under the code L, then the right-hand side of (20) is bounded by n−1 [L(P¨ ) − log P¨ (X n ) − [− log P ∗ (X n )]] ≤ n−1 [L(P ∗ ) − log P ∗ (X n ) + log P ∗ (X n )] = L(P ∗ )/n. Thus the risk converges to 0 at least at rate 1/n, which, as compared to other statistical estimators, is quite fast. Note the role of luckiness: the smaller the code length we assigned to (good approximations of) P ∗ , the better the performance bound on two-part MDL. In many cases, the structure of H is such that we can design codes with good worst-case behaviour, even if H is uncountable. For example, if H is a parametric model and if a two-part code with a clever discretisation scheme is used, then it can be shown using (20) that maxP ∗ ∈H Rn (P ∗ , P¨ ) = O(1/n). This is the minimax optimal rate, i.e. in the worst case over all P ∗ ∈ H, no statistical estimation method can have rate smaller than c/n for some c > 0. In the same way one can use (20) to show, for suitably chosen L, that two-part MDL achieves the minimax convergence rates for a range of nonparametric problems. Compare this to the classical maximum likelihood estimation method, which also achieves the minimax rate on parametric problems, but which may fail utterly on nonparametric problems. Theorem 8 is due (in slightly different versions) to A. Barron, J. Li and T. Zhang, and appeared first in Li’s Ph.D. thesis [Li, 1999]. Its precise statement and implications are discussed at length in Chapter 15 of [Gr¨ unwald, 2007]. It is of fundamental importance, since it directly links compression and learning: the better we can compress data from P ∗ , the faster we can identify a good approximation of P ∗ . 4.2.2
Model Selection
We have described models as formal representations of hypotheses; we will now try to determine which of the hypotheses is “true”, in the sense that the corresponding model contains the data generating distribution P ∗ . The goal is then to identify this true model on the basis of as little data as possible. In the MDL approach to this problem, we use a two-part code on the outer level (since, based on data xn , we want to output a particular model M, which therefore has to be encoded explicitly), and we can use any type of universal code at the inner level. Since the universal code for the true model achieves a code length not much larger than the code length of the best code in the model (it was designed to achieve small regret), and the best code in the model achieves code length at most as large as the data generating distribution, it seems reasonable to assume that, as more data are being gathered, a true model will eventually be selected by MDL.
888
Steven de Rooij and Peter D. Gr¨ unwald
This intuition is confirmed in the form of the following consistency result, which is one of the pillars of MDL model selection. The result applies to Bayesian model selection as well. We give a precise statement of the theorem, but also provide a more informal explanation for those unfamiliar with all the concepts involved. THEOREM 9 Model Selection Consistency. Let H = {M1 , M2 , . . .} be a countably infinite set of parametric models. For all i ∈ Z+ , let Mi be 1-to-1 parameterised by Θi ⊆ Rk for some k ∈ N and define a prior density wi on Θi . Define a prior distribution W on the model indices. We require for all integers j > i > 0 that with wj -probability 1, a distribution drawn from Mj is mutually singular with all distributions in Mi . Define the MDL model selection criterion based on Bayesian universal codes to represent the models: Z n n Pθ (x )wi (θ) dθ . δ(x ) = arg min − log W (i) − log i
θ∈Θi
Then for all δ ∗ ∈ Z+ , for all θ∗ ∈ Θδ∗ , except for a subset of Θδ∗ of Lebesgue measure 0, it holds for X1 , X2 , . . . ∼ Pθ∗ that Pθ∗ (∃n0 ∀n ≥ n0 : δ(X n ) = δ ∗ ) = 1. Proof. Proofs of various versions of this theorem can be found in [Barron, 1985; Dawid, 1992a; Barron et al., 1998; Gr¨ unwald, 2007] and others. Thus, the theorem can be applied to a sequence of models, as long as models do not contain a significant number of distributions that also occur in earlier models. The δ-function in the theorem indicates which model is selected by MDL given a specific sequence of observations. The theorem then states, roughly, that from some sample size n0 onwards the true model will be always be selected by MDL. The theorem uses a Bayesian code on the inner level (the distributions within each model). However as conjectured in [Gr¨ unwald, 2007], it can probably be extended to other universal codes on the inner level, such as NML (for some special cases, it was proven in [van Erven et al., 2008] that this is indeed the case). Note that (as we keep emphasising) we do not make any assumptions about the true distribution P ∗ . However, if we are in the ideal situation where one of the models contains P ∗ , then the theorem guarantees that this model will be selected once sufficient data have been accumulated. This property of model selection criteria is called consistency. Unfortunately, the theorem says nothing about the more realistic scenario where the models are merely approximations of the true data generating process. If the true data generating process P ∗ is not an element of any of the models, we might still be interested in finding the best approximating model, for instance the model that contains the distribution Pθ˜ minimising the Kullback-Leibler divergence D(P ∗ kPθ˜). It is not known under what circumstances δ actually achieves this. On the other hand, some model
Luckiness and Regret in Minimum Description Length Inference
889
selection criteria that are often used in practice (such as maximum likelihood model selection, or AIC [Akaike, 1974]) are known not to be consistent, even in the restricted sense of Theorem 9. Therefore consistency of MDL and Bayesian model selection is reassuring. At this point, we should add that MDL model selection, while being consistent, does not always achieve the minimax optimal convergence rate, in the following sense: suppose we first use MDL model selection to select a model Mk , and then, within the chosen model, we use a second estimator (which may again be a two-part code MDL estimator), to select a distribution P˙θ ∈ Mk . Then, for some standard situations such as linear regression, using MDL model selection in the first stage, the Hellinger risk Rn (P ∗ , P¨θ ) as defined in Section 4.2.1 will not converge to 0 at the fastest possible rate; in general, it is slower by a log n factor. The same holds for BIC and Bayes factor model selection: all these model selection methods are typically consistent, but do not achieve the optimal rate. In contrast, methods such as AIC and leave-one-out cross-validation (see Section 4.3 below) can be inconsistent, but do achieve the minimax rate. Quite recently, we have shown that with a different, more sophisticated code at the outer level of MDL selection — the so-called switch code — this problem can be overcome, and both consistency and optimal convergence rates are achieved; see Section 5.5.
4.3
Prediction
To use MDL for sequential prediction, we have to use codes that can be rendered as prequential forecasting systems. This includes any prequential universal code, in particular the Bayesian universal code and “predictive MDL” [Rissanen, 1984; Rissanen, 1986a]. See Section 3.2.4 for more information about prequential codes. There are several ways to evaluate the performance of a probabilistic prediction strategy P . In the theoretical machine learning literature [Cesa-Bianchi and Lugosi, 2006], it is customary to look at the accumulated loss defined as (21)
n X i=1
LOSS(xi , P (Xi = · | xi−1 )).
Here LOSS can be any function mapping pairs of outcomes x ∈ X and probability distributions on X to the real numbers, possibly including infinity. A loss function with particularly useful theoretical properties is the logarithmic loss, LOSS(x, P ) = − log P (x). With this loss function, the accumulated loss reduces to (22)
n X i=1
− log P (Xi = xi | xi−1 ) = − log
n Y
i=1
P (xi | xi−1 ) = − log P (xn ),
which is just the code length of the sample! This binds compression and predictive performance together directly: the more we compress, the better the predictive performance in terms of one step ahead prediction loss. In the context of MDL,
890
Steven de Rooij and Peter D. Gr¨ unwald
let P represent a universal code relative to some set of hypotheses H. Then (22) implies that, if one of the distributions in H, used as a prediction strategy, predicts xn well, and P is guaranteed to have small regret relative to H, then P is also guaranteed to predict xn well. An important question is of course whether this result transfers from the logarithmic loss to the particular loss function that is of interest in the problem at hand. It turns out that, if, as in Section 4.2.1, we are prepared to assume that one of the distributions in H is true, then (with some caveats) this is indeed the case; a more precise statement can be found in [Gr¨ unwald, 2007, Chapter 17]. But even if we stick to logarithmic loss, the simple identity (22) has several important consequences. First of all, it is the basis of a fundamental theorem, due to A. Barron [Barron, 1998] and closely related to Theorem 8, that gives explicit guarantees about the convergence rate for prediction strategies based on MDL prequential universal codes. Analogously to Theorem 8, the better a code compresses data sampled from P ∗ , the faster it learns to predict outcomes from P ∗ well. This is discussed in Chapter 15 of [Gr¨ unwald, 2007]. The theorem also has consequences for estimation, because we may also think of the predictions P (Xn = · | X n−1 ) as an estimate of P ∗ rather than as a predictor of Xn . In this way, each prequential universal code in fact defines an estimator relative to P ∗ ∈ H; this provides an alternative, “prequential” rather than “2-part”, MDL estimation method. Under this interpretation, Barron’s theorem now expresses something even more similar to Theorem 8: the better we can compress data from P ∗ using a prequential code, the faster the estimator based on this code converges to P ∗ . Yet another consequence of (22) is that in some common situations, MDL model selection is a special case of Dawid’s theory of prequential model validation [Dawid, 1984; Dawid, 1992b; Dawid, 1992a]. His weak prequential principle states that we should evaluate a prequential forecasting system based solely on the predictions it made for the data that were actually observed, not the predictions that it might have made for other sequences of observations. This can be achieved by measuring the performance of a forecasting system in terms of its one step ahead prediction loss (21). Thus, suppose we want to choose between a finite number of models M1 , . . . , Mk . According to the prequential idea, we should associate each of these models with a prediction strategy, , and then pick the model whose associated prediction strategy achieves the smallest accumulated loss (21) on x1 , . . . , xn . The prediction strategies could, for example, be maximum likelihood plug-in codes as in the example of Section 3.2.4. Such an approach is similar to leave-one-out cross validation in the sense that the models are evaluated using observed data only. The only difference is that in cross validation the prediction of each xi is based on both the previous outcomes x1 , . . . , xi−1 and the future outcomes xi+1 , . . . , xn . In prequential validation, only the previous outcomes are used. As we can see from (22), if prequential validation is applied with the logarithmic loss function and prediction strategies P1 , . . . , Pk associated with models M1 , . . . , Mk respectively, then it is simply equivalent to MDL model selection between these models with a
Luckiness and Regret in Minimum Description Length Inference
891
uniform code (prior) on the outer level, and the codes based on P1 , . . . , Pk on the inner level. 5
MDL IN PERSPECTIVE
In this section we link MDL to other approaches to learning, discussing both philosophical and technical differences and similarities. We also trace the development of the MDL principle back to its origin in algorithmic information theory and Kolmogorov complexity, and point out areas of recent development.
5.1
Basic Principles
The following advice summarises the basic principles of MDL as discussed above. • Use codes to represent hypotheses about the data; measure the usefulness of the hypothesis by achieved compression. • Use a code that achieves low regret whatever data are observed, and carefully consider its luckiness properties. This applies to both the inner and the outer level for model selection. • Design codes with performance guarantees that depend only on the model, but not on the data generating process given the model.2
5.2
Occam’s Razor
When two models fit the data equally well, MDL will choose the one that is the “simplest” in the sense that it allows for a shorter description of the data. As such, it implements a precise form of Occam’s razor — even though as more and more data become available, the model selected by MDL may become more and more “complex”! Occam’s razor is sometimes criticised for being either (1) arbitrary or (2) false [Webb, 1996; Domingos, 1999]. Do these criticisms apply to MDL as well? 1. “Occam’s Razor (and MDL) Is Arbitrary” Because “description length” is a syntactic notion it may seem that MDL selects an arbitrary model: different codes would have led to different description lengths, and therefore, to different models. By changing the encoding method, we can make ‘complex’ things ‘simple’ and vice versa. This criticism overlooks the fact that we are not allowed to use just any code we like! MDL prescribes, for each model Mk under consideration, the use of a universal code Lk that achieves small worst-case regret. This 2 This may seem to be at odds with our discussion of MDL consistency in Section 4.2: do we not assume that the true distribution P ∗ is in one of the models there? However, the two statements “We do not know anything about P ∗ ” and “If P ∗ is in one of the models, our method behaves especially nicely” are not contradictory.
892
Steven de Rooij and Peter D. Gr¨ unwald
prescription puts strong constraints on what models may be viewed as “simple” and what models may be viewed as “complex”. First, we note that, if the models Mk = {Pθ | θ ∈ Θk } are parametric and the parameter sets are suitably bounded, then we can associate with each model an inherent complexity, namely its worstcase regret (15). We already explained why it makes sense to call this quantity “parametric complexity”. It does not depend on the chosen parameterisation, and is thus certainly not arbitrary. But in practice, we do not want to restrict the parameter set, and then we have to use codes Lk with different regret on different sequences. However, it can be shown that, whatever code we Lk we use, for all ˆ n ) ∈ Θ0 , the regret of Lk subsets Θ0 of Θ, for “most” sequences xn such that θ(x is larger than (16), which we repeat here for convenience: Z p n k log + log det I(θ) dθ + o(1). (23) 2 2π Θ0
For a precise statement, see [Gr¨ unwald, 2007, Chapter 14]. Thus, the fraction of sequences for which Lk is “lucky” and achieves regret substantially smaller than (23) is vanishingly small. Therefore, it still makes sense to call the regret of Lk the “complexity of model Mk relative to sequence xn ”. For most sequences, this complexity will be equal to (k/2) log n plus some small constant.
EXAMPLE 10. Let us consider a prototypical model selection example, regression with polynomial models. The data are given by (x1 , y1 ), . . . , (xn , yn ) with (xi , yi ) ∈ R2 . The models are of form Mk = {Pθ | θ ∈ Rk+1 } where Pθ ∈ Mk , θ = Pk j (θ0 , . . . , θk ), prescribes that Y = fθ (X) + Z, where fθ (x) = j=0 θj x is a kdegree polynomial and Z is a normally distributed error (noise) term with mean 0. Pθ is extended to n outcomes by independence. For example, M1 represents the set of 1-dimensional polynomials, i.e. straight lines Y = aX + b, turned into probability distributions by adding a stochastic noise term. For now we restrict ourselves to model selection between M1 and M2 . If we perform MDL model selection with these two models, we must associate each Mk with a universal code defined by its length function Lk , and we must also provide a universal code on the outer level with lengths L(k) for k ∈ {1, 2}. We may interpret the regret achieved by Lk relative to Mk for a given sequence as the “complexity” of Mk relative to that sequence. Now the sometimes-heard argument “MDL is arbitrary since one could have used codes L1 and L2 such that M2 is always preferred over M2 ”, is simply wrong. Since the code on the outer level should be universal and achieve small worst-case regret, we prefer a uniform code here. The code L2 we assign to M2 should be such that its regret is approximately (3/2) log n plus a small constant, since M2 has 3 free parameters. The regret of L1 relative to M1 should be approximately (2/2) log n = log n plus a small constant. As indicated by (23), codes such that for most sequences, the regret relative to M2 is substantially smaller than (3/2) log n do not exist. Codes such that for most sequences, the regret relative to M1 is substantially larger than (2/2) log n do exist, but they are highly wasteful, and the MDL principle does not allow us to use them. Hence, if proper MDL codes are used, then for most sequences, the “complexity” associated
Luckiness and Regret in Minimum Description Length Inference
893
with M2 relative to that sequence, which we identified with the regret of L2 relative to that sequence, is substantially larger than the complexity associated with M1 relative to that sequence, the difference being about (1/2) log n. It can be shown that this implies that, if proper MDL codes are used, then there must be sequences on which MDL chooses M1 rather than M2 and vice versa. There are no “MDL codes” for which M1 is always preferred over M2 or vice versa. ♦
It is true that MDL leaves room for subjective input in the form of the luckiness principle. The statistician has the freedom to choose some special set A of hypotheses, or parameters within a model, and make them especially easy to learn. Of course, it may occur that these lucky few turn out not to improve compression performance, while giving another set B special treatment would have improved performance by a lot. This may seem arbitrary, but consider that the alternative is to make everything hard to learn. Also consider that subjective is not the same as arbitrary, there may be good reason to choose A rather than B as the lucky set, and if there is not, then we can even consider the union A ∪ B as a potential lucky set, at the modest price of increasing the codelength for sequences with maximum likelihood parameters in A slightly. It is completely impossible however to make everything lucky, so there are hard constraints on the freedom of the statistician. To illustrate, we turn back to the regression example for polynomials of degree one or two. Given enough data, we should be able to estimate the parameters of any second degree polynomial up to any desired precision, but this will occur a lot faster if the parameters happen to be in the special subset M1 , which can thus be viewed as a lucky set. We could also have considered an alternative lucky set M′1 , containing all polynomials with Yi = aXi + b + 13X 2 . The choice between M1 and M′1 is subjective (not arbitrary), but whichever we pick, we can only make a code such that M1 is really lucky if M1 is a very small fraction of M2 , for which the worst-case regret is set in stone. Here “really lucky” means that the regret of sequences with maximum likelihood distributions close to or in M1 is significantly smaller than the regret of the remaining sequences. 2. “Occam’s Razor Is False” It is often claimed that Occam’s razor is false — we often try to model real-world situations that are arbitrarily complex, so why should we favour simple models? In the words of Webb [Webb, 1996], “What good are simple models of a complex world?”3 The short answer is: even if the true data-generating machinery is very complex, it may be a good strategy to prefer simple models for small sample sizes. Thus, MDL (and the corresponding form of Occam’s razor) is a strategy for inferring models from data (“choose simple models at small sample sizes”), not a statement about how the world works (“simple models are more likely to be true”) — indeed, a strategy cannot be true or false; it is “clever” or “stupid.” And the strategy of preferring simpler models is clever even if the data-generating process is highly complex, as illustrated by the following example. 3 Quoted
with permission from KDD Nuggets 96,28, 1996.
894
Steven de Rooij and Peter D. Gr¨ unwald
EXAMPLE 11 Simple Models of Complex Sources. We return to the polynomial regression setting of the previous example, but now we suppose that the data are sampled from a very complex source, say, a polynomial of degree 1,000. Then, if we use two-part code MDL to learn a polynomial for data ((x1 , y1 ), . . . , (xn , yn )), the model Mk selected by MDL at sample size n will have k < 1, 000 for small n, but k will typically increase with n, and almost surely it will settle on M1000 for all n larger than some n0 (by consistency Theorem 9). Now, one could claim that MDL infers the “wrong” degree polynomial for n < n0 . Following this line of reasoning, suppose that we somehow knew the “right” degree, and redefined our method to report: the degree is 1,000. If we then try to fit this model for small n, say n < 1000, we would have far to few data to accurately estimate all 1001 parameters, which could well lead to disastrous predictions of future outcomes from the same source! On the other extreme, a low degree polynomial, say a parabola, may not be able to fit the data so well, but it has so few parameters that 1,000 observations suffice to estimate the optimal parameter values for this simple approximation quite accurately. Therefore, the mean squared error achieved on the available data should be a good indication of the squared error that should be expected on future data. So even though the model is “wrong” in the sense that it is much simpler than the true source, it yields much more reliable predictions. MDL implements a tradeoff between these two extremes: given enough data, the complex truth will be discovered, but if there is insufficient data to do so, MDL will report a simple, but useful, approximation. The source we considered so far was complex, but at least it was in one of the considered models. It is far from uncommon however that the source is not in any of the models. The same reasoning applies in this case: a polynomial of sufficiently low degree will give a correct impression of how well it will predict future data, even if the data are not polynomial and none of the considered models is “true”!♦ The ideas in this example appear not only in the MDL literature, but is also the basis of Vapnik’s [Vapnik, 1998] structural risk minimisation approach and many standard statistical methods for nonparametric inference. In such approaches one acknowledges that the data-generating machinery can be very complex, and might not even be consistent with the considered models. Nevertheless, it is still a good strategy to approximate it by simple hypotheses (low-degree polynomials) as long as the sample size is small. Summarising: The Inherent Difference between Under- and Overfitting (A) If we choose an overly simple (small) model for our data, then the best-fitting point hypothesis within the model is likely to be almost the best predictor, within the simple model, of future data coming from the same source. On the other hand, (B), if we overfit (choose a very complex, i.e., large, model) and there is noise in our data, then, even if the complex model contains the “true” point hypothesis, the best-fitting point hypothesis within the model is likely to lead to very bad predictions of future data coming from the same source. This statement is imprecise and is meant to convey the general idea, but it becomes provably true if we use MDL’s measure
Luckiness and Regret in Minimum Description Length Inference
895
of model complexity; we measure prediction quality by logarithmic loss; and we assume that one of the distributions in H actually generates the data. In fact, statement (A) is an implication of Barron’s theorem about MDL prediction that we briefly referred to in Section 4.3.
5.3
MDL, Bayesian Inference and Frequentist Statistics
The MDL, Bayesian and Frequentist schools of thought differ in their interpretation of how the concept of probability relates to the real world. The frequentist school of thought holds that probability can only express something about the real world in the context of a repeatable experiment. The frequency of a particular observation converges as more observations are gathered; this limiting value is then called the probability. This interpretation is too restrictive for many applications; for example the probability that a suspect is guilty (as often required in legal reasoning), or the probability that it will rain tomorrow, are not frequentist, because no repeatable experiments are involved. According to the subjective Bayesian school on the other hand, a probability expresses a degree of belief in a certain proposition, which can make sense even outside the context of a repeatable experiment. However, this interpretation is also problematic, because in practice people of necessity often work with very crude models, which everybody agrees have no real truth to them. In such cases, probabilities are often used to represent “beliefs” which one knows a priori to be false! (In order to avoid offending people, we have to point out here that there are actually many different Bayesian schools of thought, and many practitioners would probably hesitate to subscribe completely to the subjectivist interpretation.) For further discussion, see e.g. [Savage, 1954; Berger, 1985]. In stark contrast, from an MDL point of view a probability distribution is really just another word for a code, which allows us to avoid this philosophical can of worms altogether. We do not think of codes in terms of “truth” or “belief”; instead, a code is judged by its efficiency when it is applied in practice: a good code achieves a short code length on the data that we observe, whether it is based on valid assumptions about the truth or not. In spite of these quite profound differences in interpretation, technically MDL and Bayesian inference are very closely related. In fact, because Bayesian codes are usually mathematically elegant and, with a suitable choice of prior, can be made to achieve low regret in the worst case, they more often than not turn out to be the code of preference in MDL settings. (There is then a close correspondence between “luckiness functions” and prior distributions, but as explained in [Gr¨ unwald, 2007, Chapter 17], some differences remain.) However, MDL research sometimes employs non-Bayesian codes, while Bayesians sometime use priors which do not guarantee low regret. In fact, there are some known inconsistency problems with Bayesian inference in nonparametric settings [Diaconis and Freedman, 1986]. These problems are invariably due to the use of priors which achieve larger regret than necessary. By adopting nonparametric priors with small regret, as
896
Steven de Rooij and Peter D. Gr¨ unwald
prescribed by the MDL principle, the problems disappear. This is a direct consequence of results such as Theorem 8, which show that small coding regret implies fast learning.
5.4
A Brief History of MDL
The practical MDL principle that we discuss in this book has mainly been developed by Rissanen in a series of papers starting in 1978 with [Rissanen, 1978]. It has its roots in the theory of Kolmogorov complexity [M.Li and Vit´anyi, 2008], developed in the 1960s by Solomonoff [Solomonoff, 1964], Kolmogorov [Kolmogorov, 1965] and Chaitin [Chaitin, 1966; Chaitin, 1969]. Among these authors, Solomonoff (a former student of the famous philosopher of science, Rudolf Carnap) was explicitly interested in inductive inference. His 1964 paper contains explicit suggestions on how the underlying ideas could be made practical, thereby foreshadowing some of the later work on two-part MDL. While Rissanen was not aware of Solomonoff’s work at the time, Kolmogorov’s 1965 paper [Kolmogorov, 1965] did serve as an inspiration for Rissanen’s [Rissanen, 1978] first development of MDL. Another important inspiration for Rissanen was Akaike’s (1973) AIC method for model selection (Section 4.2), essentially the first model selection method based on information-theoretic ideas [Akaike, 1973; Akaike, 1974]. Even though Rissanen was inspired by AIC, both the actual method and the underlying philosophy are substantially different from MDL. Minimum Message Length MDL is much closer related to the Minimum Message Length (MML) Principle [Wallace, 2005], developed by Wallace and his coworkers in a series of papers starting in 1968 with the groundbreaking [Wallace and Boulton, 1968]; other milestones are [Wallace and Boulton, 1975] (1975) and [Wallace and Freeman, 1987] (1987). Remarkably, Wallace developed his ideas without being aware of the notion of Kolmogorov complexity. Although Rissanen became aware of Wallace’s work before the publication of the first MDL paper [Rissanen, 1978], he developed his ideas mostly independently, being influenced more by Akaike and Kolmogorov. Indeed, despite the close resemblance of both methods in practice, the underlying philosophy is very different. A 10-page discussion of the precise relationship between MML and MDL, both technically and philosophically, can be found in Chapter 17 of [Gr¨ unwald, 2007]. The first publications on MDL only mention two-part codes. Important progress was made in 1984 when Rissanen used prequential codes for the first time [Rissanen, 1984], and again in 1987 [Rissanen, 1987], when Rissanen introduced Bayesian mixture codes in MDL. This led to the development of the notion of stochastic complexity as the shortest code length of the data given a model [Rissanen, 1986b; Rissanen, 1987]. However, the full development of the notion of “parametric complexity” (15) had to wait until 1996, when (again) Rissanen made the connection to Shtarkov’s normalised maximum likelihood code [Rissanen, 1996]. In the mean time, A. Barron [Barron, 1985] showed in his impressive Ph.D. thesis how a specific
Luckiness and Regret in Minimum Description Length Inference
897
version of the two-part code criterion has excellent frequentist statistical consistency properties. This was extended in 1991 by Barron and Cover [Barron and Cover, 1991] who achieved a breakthrough for two-part codes: they gave clear prescriptions on how to design codes for hypotheses, relating codes with good minimax code length properties to rates of convergence in statistical consistency theorems, both for parametric and nonparametric problems. Some of the ideas of Rissanen’s 1987 and Barron and Cover’s 1991 paper were, as it were, unified when in 1996 Rissanen [Rissanen, 1996] introduced the normalised maximum likelihood code. The resulting theory was summarised for the first time by Barron, Rissanen and Yu [Barron et al., 1998], and is the main subject of the first (2007) comprehensive overview of the field, [Gr¨ unwald, 2007]. In this book, the modern versions of MDL whee code design is based on explicit rather than ad hoc goals, are called “refined MDL”.
5.5
Recent Developments
While this text is intended to be introductory rather than exhaustive, there have been a couple of recent developments in MDL research that shed new light on its interpretation, and which we consider relevant for anyone who currently uses or plans to use MDL or Bayesian inference in practice. Luckiness Functions Although we emphasised the importance of luckiness in the design of the code for hypotheses in Section 2, it did not really play a part in the definition of the universal codes later in Section 3.2. This is for historical reasons: it was not known until recently (and nor was it considered important) how universal codes could be constructed with both low-worst case regret and a freely chosen luckiness structure. In [Gr¨ unwald, 2007] it is shown how this can be achieved using luckiness functions. A luckiness function a : Θ → R roughly expresses which parameter values in θ are especially easy to learn, should they provide good fit. While luckiness-related concepts were always implicit in MDLlike inference, the idea of using a luckiness function seems due to Barron [Barron, 1998], although he did not call it this way. The luckiness terminology was first adopted by Gr¨ unwald [Gr¨ unwald, 2007], who imported it from the computational learning theory literature [Shawe-Taylor et al., 1998] where it plays a somewhat similar role. Model Switching The outer-level codes that are used in model selection problems are designed to achieve low worst-case regret, where regret is normally measured against the “best” model, i.e. the one that allows for the most compression. However, in predictive settings, practitioners often find that there is no single best model. Instead, simpler models tend to predict well as long as the sample size is small, while more complex models predict better as more data are gathered. This so-called catch-up phenomenon leads to the idea of measuring regret not against the model with the best overall predictive performance, but against a sequence of
898
Steven de Rooij and Peter D. Gr¨ unwald
models, where each model in the sequence is used in turn to predict a number of outcomes. There are efficient and simple algorithms to achieve this, which often leads to significant improvements in model selection performance, both in theory and in practice. For example, in many situations, one can achieve both consistency and optimal convergence rates; see Section 4.2.2. For more details, please refer to [van Erven et al., 2008]. BIBLIOGRAPHY [Akaike, 1973] H. Akaike. Information theory and an extension of the maximum likelihood principle. In B. N. Petrov and F. Csaki, editors, Second International Symposium on Information Theory, pages 267–281, Budapest, 1973. Akademiai Kiado. [Akaike, 1974] H. Akaike. A new look at statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723, 1974. [Barron and Cover, 1991] A.R. Barron and T.M. Cover. Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4):1034–1054, 1991. [Barron et al., 1998] A. Barron, J. Rissanen, and B. Yu. The Minimum Description Length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743– 2760, 1998. [Barron, 1985] A.R. Barron. Logically Smooth Density Estimation. PhD thesis, Dept. of Electrical Engineering, Stanford University, Stanford, CA, 1985. [Barron, 1998] A.R. Barron. Information-theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems. In Bayesian Statistics 6, pages 27–52. Oxford University Press, 1998. [Berger, 1985] J.O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer Series in Statistics. Springer-Verlag, New York, revised and expanded second edition, 1985. [Cesa-Bianchi and Lugosi, 2006] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, Cambridge, UK, 2006. [Chaitin, 1966] G.J. Chaitin. On the length of programs for computing finite binary sequences. Journal of the ACM, 13:547–569, 1966. [Chaitin, 1969] G.J. Chaitin. On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM, 16:145–159, 1969. [Chomsky, 1956] N. Chomsky. Three models for the description of language. IEEE Transactions on Information Theory, 2(3), September 1956. [Cover and Thomas, 1991] T.M. Cover and J.A. Thomas. Elements of Information Theory. Series in telecommunications. John Wiley, 1991. [Dawid, 1984] A.P. Dawid. Present position and potential developments: Some personal views, statistical theory, the prequential approach. Journal of the Royal Statistical Society, Series A, 147(2):278–292, 1984. [Dawid, 1992a] A.P. Dawid. Prequential analysis, stochastic complexity and Bayesian inference. In J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, editors, Bayesian Statistics 4, pages 109–125. Oxford University Press, 1992. [Dawid, 1992b] A.P. Dawid. Prequential data analysis. In M. Ghosh and P.K. Pathak, editors, Current Issues in Statistical Inference: Essays in Honor of D. Basu, Lecture NotesMonograph Series, pages 113–126. Institute of Mathematical Statistics, 1992. [de Rooij and Gr¨ unwald, 2006] Steven de Rooij and Peter Gr¨ unwald. An empirical study of minimum description length model selection with infinite parametric complexity. Journal of Mathematical Psychology, 50:180–190, 2006. [Diaconis and Freedman, 1986] P. Diaconis and D. Freedman. On the consistency of Bayes estimates. The Annals of Statistics, 14(1):1–26, 1986. [Domingos, 1999] P. Domingos. The role of Occam’s razor in knowledge discovery. Data Mining and Knowledge Discovery, 3(4):409–425, 1999. [Gr¨ unwald and Vit´ anyi, 2008] P.D. Gr¨ unwald and P.M.B. Vit´ anyi. Algorithmic information theory. In P. Adriaans and J. van Benthem, editors, Handbook of the Philosophy of Science, volume 8: Philosophy of Information, pages 289–325. Elsevier Science, 2008.
Luckiness and Regret in Minimum Description Length Inference
899
[Gr¨ unwald, 2007] P.D. Gr¨ unwald. The Minimum Description Length Principle. MIT Press, June 2007. Chapter 17, containing further discussion of philosophical and conceptual issues, can be freely downloaded from www.grunwald.nl. [Kolmogorov, 1965] A.N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of Information Transmission, 1(1):1–7, 1965. [Kontkanen and Myllym¨ aki, 2007] P. Kontkanen and P. Myllym¨ aki. A linear-time algorithm for computing the multinomial stochastic comp lexity. Information Processing Letters, 103:227– 233, September 2007. [Li, 1999] J.Q. Li. Estimation of Mixture Models. PhD thesis, Yale University, New Haven, CT, 1999. [M.Li and Vit´ anyi, 2008] M.Li and P. Vit´ anyi. An Introduction to Kolmogorov Complexity and its Applications. Springer-Verlag, New York, 3rd edition, 2008. [Rissanen and Roos, 2007] J. Rissanen and T. Roos. Conditional NML universal models. In 2007 Information Theory and Applications Workshop (ITA-07), pages 337–314, 2007. [Rissanen, 1976] J. Rissanen. Generalized Kraft inequality and arithmetic coding. IBM Journal of Research and Development, 20(3), 1976. [Rissanen, 1978] J. Rissanen. Modeling by the shortest data description. Automatica, 14:465– 471, 1978. [Rissanen, 1983a] J. Rissanen. A universal data compression system. IEEE Transactions on Information Theory, IT-29(5):656–664, 1983. [Rissanen, 1983b] J. Rissanen. A universal prior for integers and estimation by Minimum Description Length. Annals of Statistics, 11:416–431, 1983. [Rissanen, 1984] J. Rissanen. Universal coding, information, prediction and estimation. IEEE Transactions on Information Theory, 30:629–636, 1984. [Rissanen, 1986a] J. Rissanen. A predictive least squares principle. IMA Journal of Mathemathical Control and Information, 3:211–222, 1986. [Rissanen, 1986b] J. Rissanen. Stochastic complexity and modeling. Annals of Statistics, 14:1080–1100, 1986. [Rissanen, 1987] J. Rissanen. Stochastic complexity. Journal of the Royal Statistical Society, Series B, 49:223–239, 1987. Discussion: 252–265. [Rissanen, 1989] J. Rissanen. Stochastic Complexity in Statistical Inquiry, volume 15 of Series in Computer Science. World Scientific, 1989. [Rissanen, 1996] J. Rissanen. Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42(1):40–47, 1996. [Savage, 1954] L.J. Savage. The Foundations of Statistics. Dover Publications, 1954. [Shannon, 1948] C.E. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:379–423, 623–656, 1948. [Shawe-Taylor et al., 1998] J. Shawe-Taylor, P. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimisation over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998. [Solomonoff, 1964] R.J. Solomonoff. A formal theory of inductive inference, part 1 and part 2. Information and Control, 7:1–22, 224–254, 1964. [Tromp, 2007] J. Tromp. Binary lambda calculus and combinatory logic. Available at http://www.cwi.nl/ tromp/cl/cl.html, 2007. [van Erven et al., 2008] T. van Erven, P.D. Gr¨ unwald, and S. de Rooij. Catching up faster by switching sooner: a prequential solution to the AIC-BIC dilemma. Available at http://arxiv.org/abs/0807.1005, 2008. Submitted to Journal of the Royal Statistical Society, Series B. A much shorter version appears in the proceedings of NIPS 2008. [Vapnik, 1998] V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. [Wallace and Boulton, 1968] C.S. Wallace and D.M. Boulton. An information measure for classification. Computing Journal, 11:185–195, 1968. [Wallace and Boulton, 1975] C.S. Wallace and D.M. Boulton. An invariant Bayes method for point estimation. Classification Society Bulletin, 3(3):11–34, 1975. [Wallace and Freeman, 1987] C.S. Wallace and P.R. Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society B, 49:240–251, 1987. Discussion: pages 252–265. [Wallace, 2005] C.S. Wallace. Statistical and Inductive Inference by Minimum Message Length. Springer-Verlag, New York, 2005.
900
Steven de Rooij and Peter D. Gr¨ unwald
[Webb, 1996] G.I. Webb. Further experimental evidence against the utility of Occam’s razor. Journal of Artificial Intelligence Research, 4:397–417, 1996. [Willems et al., 1997] F.M.J. Willems, Y.M. Shtarkov, and Tj.J. Tjalkens. Reflections on ‘the Context-Tree Weighting method: Basic properties’. Newsletter of the IEEE Information Theory Society, 1997.
MML, HYBRID BAYESIAN NETWORK GRAPHICAL MODELS, STATISTICAL CONSISTENCY, INVARIANCE AND UNIQUENESS David L. Dowe 1
INTRODUCTION
The problem of statistical — or inductive — inference pervades a large number of human activities and a large number of (human and non-human) actions requiring ‘intelligence’. Human and other ‘intelligent’ activity often entails making inductive inferences, remembering and recording observations from which one can make inductive inferences, learning (or being taught) the inductive inferences of others, and acting upon these inductive inferences. The Minimum Message Length (MML) approach to machine learning (within artificial intelligence) and statistical (or inductive) inference gives us a trade-off between simplicity of hypothesis (H) and goodness of fit to the data (D) [Wallace and Boulton, 1968, p. 185, sec 2; Boulton and Wallace, 1969; 1970, p. 64, col 1; Boulton, 1970; Boulton and Wallace, 1973b, sec. 1, col. 1; 1973c; 1975, sec 1 col 1; Wallace and Boulton, 1975, sec. 3; Boulton, 1975; Wallace and Georgeff, 1983; Wallace and Freeman, 1987; Wallace and Dowe, 1999a; Wallace, 2005; Comley and Dowe, 2005, secs. 11.1 and 11.4.1; Dowe, 2008a, sec 0.2.4, p. 535, col. 1 and elsewhere]. There are several different and intuitively appealing ways of thinking of MML. One such way is to note that files with structure compress (if our file compression program is able to find said structure) and that files without structure don’t compress. The more structure (that the compression program can find), the more the file will compress. Another, second, way to think of MML is in terms of Bayesian probability, where P r(H) is the prior probability of a hypothesis, P r(D|H) is the (statistical) likelihood of the data D given hypothesis H, − log P r(D|H) is the (negative) log-likelihood, P r(H|D) is the posterior probability of H given D, and P r(D) is the marginal probability of D — i.e., the probability that D will be generated (regardless of whatever the hypothesis might have been). Applying Bayes’s theorem twice, with or without the help of a Venn diagram, we have P r(H|D) = P r(H&D)/P r(D) = (1/P r(D)) P r(H)P r(D|H). Choosing the most probable hypothesis (a posteriori) is choosing H so as to maximise P r(H|D). Given that P r(D) and 1/P r(D) are independent of the choice of Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
902
David L. Dowe
hypothesis H, this is equivalent to choosing H to maximise P r(H) . P r(D|H). By the monotonicity of the logarithm function, this is in turn equivalent to choosing H so as to minimise − log P r(H) − log P r(D|H). From Shannon’s information theory (see sec. 2.1), this is the amount of information required to encode H (in the first part of a two-part message) and then encode D given H (in the second part of the message). And this is, in turn, similar to our first way above of thinking about MML, where we seek H so as to give the optimal two-part file compression. We have shown that, given data D, we can variously think of the MML hypothesis H in at least two different ways: (a) as the hypothesis of highest posterior probability and also (b) as the hypothesis giving the two-part message of minimum length for encoding H followed by D given H; and hence the name Minimum Message Length (MML). Historically, the seminal Wallace and Boulton paper [1968] came into being from Wallace’s and Boulton’s finding that the Bayesian position that Wallace advocated and the information-theoretic (conciseness) position that Boulton advocated turned out to be equivalent [Wallace, 2005, preface, p. v; Dowe, 2008a, sec. 0.3, p. 546 and footnote 213]. After several more MML writings [Boulton and Wallace, 1969; 1970, p. 64, col. 1; Boulton, 1970; Boulton and Wallace, 1973b, sec. 1, col. 1; 1973c; 1975, sec. 1, col. 1] (and an application paper [Pilowsky et al., 1969], and at about the same time as David Boulton’s PhD thesis [Boulton, 1975]), their paper [Wallace and Boulton, 1975, sec. 3] again emphasises the equivalence of the probabilistic and information-theoretic approaches. (And all of this work on Minimum Message Length (MML) occurred prior to the later Minimum Description Length (MDL) principle discussed in sec. 6.7 and first published in 1978 [Rissanen, 1978].) A third way to think about MML is in terms of algorithmic information theory (or Kolmogorov complexity), the shortest input to a (Universal) Turing Machine [(U)TM] or computer program which will yield the original data string, D. This relationship between MML and Kolmogorov complexity is formally described — alongside the other two ways above of thinking of MML (probability on the one hand and information theory or concise representation on the other) — in [Wallace and Dowe, 1999a]. In short, the first part of the message encodes H and causes the TM or computer program to read (without yet writing) and prepare to output data, emulating as though it were generated from this hypothesis. The second part of the input then causes the (resultant emulation) program to write the data, D. So, in sum, there are (at least) three equivalent ways of regarding the MML hypothesis. It variously gives us: (i) the best two-part compression (thus best capturing the structure), (ii) the most probable hypothesis (a posteriori, after we’ve seen the data), and (iii) an optimal trade-off between structural complexity and noise — with the first-part of the message capturing all of the structure (no more, no less) and the second part of the message then encoding the noise. Theorems from [Barron and Cover, 1991] and arguments from [Wallace and Freeman, 1987, p241] and [Wallace, 2005, chap. 3.4.5, pp. 190-191] attest to the
MML, Hybrid Bayesian Network Graphical Models, ...
903
general optimality of this two-part MML inference — converging to the correct answer as efficiently as possible. This result appears to generalise to the case of model misspecification, where the model generating the data (if there is one) is not in the family of models that we are considering [Gr¨ unwald and Langford, 2007, sec. 7.1.5; Dowe, 2008a, sec. 0.2.5]. In practice, we find that MML is quite conservative in variable selection, typically choosing less complex models than rival methods [Wallace, 1997; Fitzgibbon et al., 2004; Dowe, 2008a, footnote 153, footnote 55 and near end of footnote 135] while also appearing to typically be better predictively. Having introduced Minimum Message Length (MML), throughout the rest of this chapter, we proceed initially as follows. First, we introduce information theory, Turing machines and algorithmic information theory — and we relate all of those to MML. We then move on to Ockham’s razor and the distinction between inference (or induction, or explanation) and prediction. We then continue on to relate MML and its relevance to a myriad of other issues. 2
2.1
INFORMATION THEORY — AND VARIETIES THEREOF
Elementary information theory and Huffman codes
Tossing a fair unbiased coin n times has 2n equiprobable outcomes of probability 2−n each. So, intuitively, it requires n bits (or binary digits) of information to encode an event of probability 2−n , so (letting p = 2−n ) an event of probability p contains − log2 p bits of information. This results holds more generally for bases k = 3, 4, ... other than 2. The Huffman code construction (for base k), described in (e.g.) [Wallace, 2005, chap. 2, especially sec. 2.1; Dowe, 2008b, p. 448] and below ensures that the code length li for an event ei of probability pi satisfies − logk pi ≈ li < − logk pi + 1. Huffman code construction proceeds by taking m events e1 , . . . , em of probability p1 , . . . , pm respectively and building a code tree by successively (iteratively) joining together the events of least probability. So, with k = 2, the binary code construction proceeds by joining together the two events of least probability (say ei and ej ) and making a new event ei,j of probability pi,j = pi + pj . (For a k-ary code construction of arity k, we join the k least probable events together — see, e.g., fig. 3, with arity k = 3. We address this point a little more later.) Having joined two events into one event, there is now 1 less event left. This iterates one step at a time until the tree is reduced to its root. An example with k = 2 from [Dowe, 2008b, p. 448, Fig. 1] is given in Figure 1. Of course, we can not always expect all probabilities to be of the form k −n , as they are in the friendly introductory example of fig. 1. One example with k = 2 (binary) and where the probabilities are not all some k raised to the power of a negative (or zero) integer is 1/21, 2/21, 3/21, 4/21, 5/21, 6/21, as per fig. 2, which we now examine. Immediately below, we step through the stages of the binary Huffman code construction in fig. 2. The two events of smallest probability are e1 and e2 of
904
David L. Dowe
e1
1/2
0
e2
1/4
10
e3
1/16
1100
e4
1/16
1101
e5
1/8
111
--------------------------------------0 | | |______ ---------------------------| | | --------110 1/2 |_________| 1/8 |--------| 1 --------| | 1/4 |--------| 11 -------------------
Figure 1. A simple (binary) Huffman code tree with k = 2
e1
1/21
0000
e2
2/21
0001
e3
3/21
001
e6
6/21
01
e4
4/21
10
e5
5/21
11
--------| 000 3/21 |--------| | --------| 00 6/21 |--------| | 0 ------------------12/21 |--------| | ----------------------------| |-----| -----------------1 | 9/21 |-------------------------------------
Figure 2. A not so simple (binary) Huffman code tree with k = 2
MML, Hybrid Bayesian Network Graphical Models, ...
905
probabilities 1/21 and 2/21 respectively, so we join them together to form e1,2 of probability 3/21. The two remaining events of least probability are now e1,2 and e3 , so we join them together to give e1,2,3 of probability 6/21. The two remaining events of least probability are now e4 and e5 , so we join them together to give e4,5 of probability 9/21. Three events now remain: e1,2,3 , e4,5 and e6 . The two smallest probabilities are p1,2,3 = 6/21 and p6 = 6/21, so they are joined to give e1,2,3,6 with probability p1,2,3,6 = 12/21. For the final step, we then join e4,5 and e1,2,3,6 . The code-words for the individual events are obtained by tracing a path from the root of the tree (at the right of the code-tree) left across to the relevant event at the leaf of the tree. For a binary tree (k = 2), every up branch is a 0 and every down branch is a 1. The final code-words are e1 : 0000, e2 : 0001, e3 : 001, etc. (For the reader curious as to why we re-ordered the events ei , putting e6 in the middle and not at an end, if we had not done this then some of the arcs of the Huffman code tree would cross — probably resulting in a less elegant and less clear figure.) For another such example with k = 2 (binary) and where the probabilities are not all some k raised to the power of a negative (or zero) integer, see the example (with probabilities 1/36, 2/36, 3/36, 4/36, 5/36, 6/36, 7/36, 8/36) from [Wallace, 2005, sec. 2.1.4, Figs. 2.5–2.6]. An example with k = 3 is given in Fig. 3. The Huffman code construction for k = 3 is very similar to that for k = 2, but it also sometimes has something of a small pre-processing step. Each step of the k-ary Huffman construction involves joining k events together, thus reducing the number of events by (k − 1), which is equal to 2 for k = 3. So, if the number of events is even, our initial pre-processing step is to join the two least probable events together. That done, we now have an odd number of events and our code tree construction is, at each step, to join the three least probable remaining events together. We continue this until just one event is left, when we are left with just the root node and we have finished. The assignment of code-words is similar to the binary case, although the ternary construction (with k = 3) has 3-way branches. The top branch is 0, the middle branch is 1 and the bottom branch is 2. The reader is invited to construct and verify this code construction example (in fig. 3) and the earlier examples referred to above. For higher values of k, the code construction joins k events (or nodes) into 1 node each time, reducing the number of nodes by (k−1) at each step. If the number of nodes is q(k−1)+1 for some q ≥ 0, then the Huffman code construction does not need a pre-processing step. Otherwise, if the number of nodes is q(k − 1) + 1 + r for some q ≥ 0 and some r such that 1 ≤ r ≤ k − 2, then we require a pre-processing step of first joining together the (r + 1) least probable events into one, reducing the number of nodes by r to q(k − 1) + 1. The result mentioned earlier that − logk pi ≈ li < − logk pi + 1
(1)
906
David L. Dowe
e1
1/9
00
e2
1/9
01
e3
1/27
020
e4
1/27
021
e5
1/27
022
e6
1/3
1
e7
1/9
20
e8
1/9
21
e9
1/9
22
------------------00 | | ------------------01 | | 0 1/3 +----------------| | 1/9 | 02 | | ---------+--------| | | --------| |_________ 1 | -----------------------------| | | | ------------------| | | 1/3 | | -------------------+--------| 2 | -------------------
Figure 3. A simple (ternary) Huffman code tree with k = 3
MML, Hybrid Bayesian Network Graphical Models, ...
907
follows from the Huffman construction. It is customary to make the approximation that li = − logk pi . Because of the relationship between different bases of logarithm a and b that ∀x > 0 logb x = (loga x)/(loga b) = (logb a) loga x, changing base of logarithms has the effect of scaling the logarithms by a multiplicative constant, logb a. As such, the choice of base of logarithm is somewhat arbitrary. The two most common bases are 2 and e. When the base is 2, the information content is said to be in bits. When the base is e, the information content is said to be in nits [Boulton and Wallace, 1970, p. 63; Wallace, 2005, sec. 2.1.8; Comley and Dowe, 2005, sec. 11.4.1; Dowe, 2008a, sec. 0.2.3, p. 531, col. 1], a term which I understand to have had its early origins in (thermal) physics. Alternative names for the nit include the natural ban (used by Alan Turing (1912-1954) [Hodges, 1983, pp. 196-197]) and (much much later) the nat.
2.2
Prefix codes and Kraft’s inequality
Furthermore, defining a prefix code to be a set of (k-ary) strings (of arity k, i.e., where the available alphabet from which each symbol in the string can be selected is of size k) such that no string is the prefix of any other, then we note that the 2n binary strings of length n form a prefix code. Elaborating, neither of the 21 = 2 binary strings 0 and 1 is a prefix of the other, so the set of code words {0, 1} forms a prefix code. Again, none of the 22 = 4 binary strings 00, 01, 10 and 11 is a prefix of any of the others, so the set of code words {00, 01, 10, 11} forms a prefix code. Similarly, for k ≥ 2 and n ≥ 1, the k n k-ary strings of length n likewise form a prefix code. We also observe that the fact that the Huffman code construction leads to a (Huffman) tree means that the result of any Huffman code construction is always a prefix code. (Recall the examples from sec. 2.1.) Prefix codes are also known as (or, perhaps more technically, are equivalent to) instantaneous codes. In a prefix code, as soon as we see a code-word, we instantaneously recognise it as an intended part of the message — because, by the nature of prefix codes, this code-word can not be the prefix of anything else. Nonprefix (and therefore non-instantaneous) codes do exist, such as (e.g.) {0, 01, 11}. For a string of the form 01n , we need to wait to the end of the string to find out what n is and whether n is odd or even before we can decode this in terms of our non-prefix code. (E.g., 011 is 0 followed by 11, 0111 is 01 followed by 11, etc.) For the purposes of the remainder of our writings here, though, we can safely and do restrict ourselves to (instantaneous) prefix codes. A result often attributed to Kraft [1949] but which is believed by many to have been known to at least several others before Kraft is Kraft’s inequality — namely, that in a k-ary alphabet, a prefix code of code-lengths l1 , . . . , lm exists if and only Pm if i=1 k −li ≤ 1. The Huffman code construction algorithm, as carried out in our earlier examples (perhaps especially those of figs. 1 and 3), gives an informal intuitive argument as to why Kraft’s inequality must be true.
908
David L. Dowe
2.3
Entropy
Let us re-visit our result from equation (1) and the standard accompanying approximation that li = − log pi . Let us begin with the 2-state case. Suppose we have probabilities p1 and p2 = 1 − p1 which we wish to encode with code-words of length l1 = − log q1 and l2 = − log q2 = − log(1 − q1 ) respectively. As per the Huffman code construction (and Kraft’s inequality), choosing such code lengths gives us a prefix code (when these code lengths are non-negative integers). The negative of the expected code length would then be p1 log q1 + (1 − p1 ) log(1 − q1 ), and we wish to choose q1 and q2 = 1 − q1 to make this code as short as possible on average — and so we differentiate the negative of the expected code length with respect to q1 .
0
= =
d (p1 log q1 + (1 − p1 ) log(1 − q1 )) = (p1 /q1 ) − ((1 − p1 )/(1 − q1 )) dq1 (p1 (1 − q1 ) − q1 (1 − p1 ))/(q1 (1 − q1 )) = (p1 − q1 )/(q1 (1 − q1 ))
and so (p1 − q1 ) = 0, and so q1 = p1 and q2 = p2 . This result also holds for p1 , p2 , p3 , q1 , q2 and q3 in the 3-state case, as we now show. Let P2 = p2 /(p2 + p3 ), P3 = p3 /(p2 + p3 ) = 1 − P2 , Q2 = q2 /(q2 + q3 ) and Q3 = q3 /(q2 + q3 ) = 1 − Q2 , so p2 = (1 − p1 )P2 , p3 = (1 − p1 )P3 = (1 − p1 )(1 − P2 ), q2 = (1 − q1 )Q2 and q3 = (1 − q1 )Q3 = (1 − q1 )(1 − Q2 ). Encoding the events of probability p1 , p2 and p3 with code lengths − log q1 , − log q2 and − log q3 respectively, the negative of the expected code length is then p1 log q1 + (1 − p1 )P2 log((1 − q1 )Q2 ) + (1 − p1 )(1 − P2 ) log((1 − q1 )(1 − Q2 )). To minimise, we differentiate with respect to both q1 and Q2 in turn, and set both of these to 0. 0
∂ (p1 log q1 + (1 − p1 )P2 log((1−1 )Q2 ) + ∂q1 (1 − p1 )(1 − P2 ) log((1 − q1 )(1 − Q2 ))) (p1 /q1 ) − ((1 − p1 )P2 )/(1 − q1 ) − ((1 − p1 )(1 − P2 ))/(1 − q1 )
=
=
= (p1 /q1 ) − (1 − p1 )/(1 − q1 ) = (p1 − q1 )/(q1 (1 − q1 ))
exactly as in the 2-state case above, where again q1 = p1 . 0
=
∂ (p1 log q1 + (1 − p1 )P2 log((1 − q1 )Q2 ) + ∂Q2 (1 − p1 )(1 − P2 ) log((1 − q1 )(1 − Q2 )))
MML, Hybrid Bayesian Network Graphical Models, ...
909
= (((1 − p1 )P2 )/Q2 ) − ((1 − p1 )(1 − P2 )/(1 − Q2 )) = (1 − p1 ) × ((P2 /Q2 ) − (1 − P2 )/(1 − Q2 )) =
(1 − p1 ) × (P2 − Q2 )/(Q2 (1 − Q2 ))
In the event that p1 = 1, the result is trivial. With p1 6= 1, we have, of very similar mathematical form to the two cases just examined, 0 = (P2 /Q2 )−(1−P2 )/(1−Q2 ), and so Q2 = P2 , in turn giving that q2 = p2 and q3 = p3 . One can proceed by the principle of mathematical induction to show that, for Pm−1 probabilities (p1 , ..., pi , ..., pm−1 , pm = 1 − i=1 pi ) and code-words of respective Pm−1 lengths (− log q1 , ..., − log qi , ..., − log qm−1 , − log qm = − log(1 − i=1 qi )), the expected code length −(p1 log q1 + ... + pi log qi + ... + pm−1 log qm−1 + pm log qm ) is minimised when ∀i qi = pi . This expected (or average) code length, m X i=1
pi × (− log pi ) = −
m X
pi log pi
(2)
i=1
is called the entropy of the m-state probability distribution (p1 , ..., pi , ..., pm ). Note that if we sample randomly from the distribution p with code-words of length − log p, then the (expected) average long-term cost is the entropy. Where the distribution is continuous rather than (as above) discrete, the sum is replaced by an integral and (letting x be a variable being integrated over) the entropy is then defined as Z Z Z f × (− log f ) dx = − f log f dx = − f (x) log f (x) dx Z = f (x) × (− log f (x)) dx (3) And, of course, entropy can be defined for hybrid structures of both discrete and continuous, such as Bayesian network graphical models (of sec. 7.6) — see sec. 3.6, where it is pointed out that for the hybrid continuous and discrete Bayesian net graphical models in [Comley and Dowe, 2003; 2005] (emanating from the current author’s ideas in [Dowe and Wallace, 1998]), the log-loss scoring approximation to Kullback-Leibler distance has been used [Comley and Dowe, 2003, sec. 9]. The next section, sec. 2.4, introduces Turing machines as an abstract model of computation and then discusses the formal relationship between MML and minimising the length of some (constrained) input to a Turing machine. The section can be skipped on first reading.
2.4
Turing machines and algorithmic information theory
The area of algorithmic information theory was developed independently in the 1960s by Solomonoff [1960; 1964], Kolmogorov [1965] and Chaitin [1966], independently of and slightly before the seminal Wallace & Boulton paper on MML [1968].
910
David L. Dowe
Despite the near-simultaneous independent work of the young Chaitin [1966] and the independent earlier work of Solomonoff [1960; 1964] pre-dating Kolmogorov, the area of algorithmic information theory is often referred to by many as Kolmogorov complexity (e.g., [Wallace and Dowe, 1999a; Li and Vit´anyi, 1997]). Before introducing the notion of the algorithmic complexity (or Kolmogorov complexity) of a string s, we must first introduce the notion of a Turing machine [Turing, 1936; Wallace, 2005, sec. 2.2.1; Dowe 2008b, pp. 449-450]. Following [Dowe, 2008b, pp. 449-450], a Turing machine (TM) [Wallace, 2005, sec. 2.2.1; Dowe, 2008b, pp. 449-450] is an abstract mathematical model of a computer program. It can be written in a language from a certain alphabet of symbols (such as 1 and (blank) “ ”, also denoted by “⊔”). We assume that Turing machines have a read/write head on an infinitely long tape, finitely bounded to the left and infinitely long to the right. Turing machines have a set of instructions — or an instruction set — as follows. A Turing machine in a given state (with the read/write head) reading a certain symbol either moves to the left (L) or to the right (R) or stays where it is and writes a specified symbol. The instruction set for a Turing machine can be written as: f : States × Symbols → States × ({L, R} ∪ Symbols). So, the definition that we are using is that a Turing Machine M is a set of quadruples {Qn } = {hqi , qj , sk , {sl , H}i} where • qi , qj ∈ {1, . . . , m}
(the machine states)
• sk , sl ∈ {s0 , . . . , sr } • H ∈ {R, L}
(the symbols)
(tape head direction)
(such that no two quadruples have the same first and third elements). The Turing machine in state qi given input sk goes into state qj and either stays where it is and writes a symbol (sl ) or moves to the left or right (depending upon the value of H) without writing a symbol. An alternative equivalent definition of a Turing Machine M which we could equally well use instead is a set of quintuples {Qn } = {hqi , qj , sk , sl , Hi} where • (the machine states) qi , qj ∈ {1, . . . , m} • (the symbols) sk , sl ∈ {s0 , . . . , sr } • (tape head direction)
H ∈ {R, L}
and the Turing machine in state qi given input sk then goes into state qj , writes symbol sl and moves the head in direction H (and, again, we require that no two quintuples have the same first and third elements). Note that the Turing Machine (TM) in the first definition either writes a (new) symbol or moves the head at each step whereas the TM in the second of these two equivalent definitions both writes a (new) symbol and moves the head.
MML, Hybrid Bayesian Network Graphical Models, ...
911
Without loss of generality we can assume that the alphabet is the binary alphabet {0, 1}, whereupon the instruction set for a Turing machine can be written as: f : States × {0, 1} → States × ({L, R} ∪ {0, 1}). Any known computer program can be represented by a Turing Machine. Universal Turing Machines (UTMs) are like (computer program) compilers and can be made to emulate any Turing Machine (TM). An example of a Turing machine would be the program from fig. 4, which, given two inputs, x0 and x1 , adds them together, writing x0 + x1 and then stopping1 . This machine adds two unary numbers (both at least 1), terminated by blanks (and separated by a single blank). In unary, e.g., 4 is represented by “1111⊔”. In general in unary, n is represented by n 1s followed by a blank.
11R
1
2
11R
_1R 3 11R
__L
4
1_R
H
Figure 4. A Turing machine program for adding two numbers Alternatively, recalling our notation of quintuples, < qi , qj , sk , sl , H >, this Turing machine adding program from fig. 4 can be represented as: {h1, 2, 1, 1, Ri, h2, 2, 1, 1, Ri, h2, 3, ⊔, 1, Ri, h3, 3, 1, 1, Ri, h4, 5, 1, ⊔, Ri} (where state 5 is the Halting — or stop — state, also referred to as H). (This Turing machine program over-writes the blank (⊔) in the middle with a 1 and removes a 1 from the right end of the second number — and, in so doing, leaves behind the unary representation of the sum.) Another example of a Turing machine would be a program which, for some a0 and a1 , when given any input x, calculates (or outputs) a0 + a1 x. In this case, x would input in binary (base 2), and the output would be the binary representation of a0 + a1 x. A Universal Turing machine (UTM) [Wallace, 2005, sec. 2.2.5] is a Turing machine which can simulate any other Turing machine. So, if U is a UTM and M is 1 Wherever he might or might not have inherited it from, I acknowledge obtaining the figure in fig. 4 from Kevin B. Korb.
912
David L. Dowe
a TM, then there is some input cM such that for any string s, U (cM s) = M (s) and the output from U when given the input cM s is identical to the output from M when given input s. In any other words, given any TM M , there is an emulation program [or translation program] (or code) cM so that once U is input cM it forever after behaves as though it were M . The algorithmic complexity (or Kolmogorov complexity), KU (x), of a string x is the length of the shortest input (lx ) to a Universal Turing Machine such that, given input lx , U outputs x and then stops. (This is the approach of Kolmogorov [1965] and Chaitin [1966], referred to as stream one in [Wallace and Dowe, 1999a].) Algorithmic information theory can be used to give the algorithmic probability [Solomonoff, 1960; 1964; 1999; 2008] of a string (x) or alternatively also to insist upon the two-part MML form [Wallace and Dowe, 1999a; Wallace, 2005, secs. 2.2–2.3]. Let us elaborate, initially by recalling the notion of a prefix code (from sec. 2.2) and then by considering possible inputs to a UTM. Let us consider the two binary strings 0 and 1 of length 1, the four binary strings 00, 01, 10 and 11 of length 2, and (in general) the 2n binary strings of length n. Clearly, if a Turing Machine stops on some particular input (of length n), then it will stop on that input with any suffix appended. The (unnormalized) that a UTM, U , will generate x from random P probability −length(s) input is PU (x) = 2 , summing over the strings s such that s:U (s)=x U taking input s will output x and then stop. In Solomonoff’s original predictive specification [Solomonoff, 1960; 1964] (stream two from [Wallace and Dowe, 1999a]), the (unnormalized) summation actually includes more strings (and leads to a greater sum), including [Wallace, 2005, sec. 10.1.3] those strings s such that U on input s produces x and possibly a suffix — i.e., outputs a string for which x is a prefix. For this sum to be finite, we must add the stipulation that the strings s (over which we sum) must form a prefix code. In choosing the strings s to form a prefix code, the sum is not affected by insisting that the strings s are all chosen soP that (e.g.) for no prefix s′ of s does U (s′ ) = x and then halt. And, for the sum x PU (x) to be useful, we must again make sure that the strings x are prefix-free — i.e., that the strings x all together form a prefix code - so as to avoid double-counting. Clearly, 2−KU (x) < PU (x), since KU (x) takes only the shortest (and biggest) [input] term outputting x, whereas PU (x) takes all the terms which output x (whether or not we wish to also include terms which append a suffix to x). The earlier mention above of “(unnormalized)” is because, for many inputs, the UTM will not halt [Turing, 1936; Chaitin, 2005; Dowe, 2008a, footnote 70]. For the purposes of prediction, these considerations just discussed are sufficient. But, for the purposes of inference (or, equivalently, explanation or induction), we need a two-part construction — as per theorems from [Barron and Cover, 1991] and arguments from [Wallace and Freeman, 1987, p. 241; Wallace, 2005, sec. 3.4.5, pp. 190–191] (and some examples of what can go wrong when we don’t have a two-part construction [Wallace and Dowe, 1999a, sec. 6.2; 1999b, secs. 1.2, 2.3 and 3; 1999c,
MML, Hybrid Bayesian Network Graphical Models, ...
913
sec. 2]). Our two-part input to the (Universal) Turing Machine will be such that [Wallace and Dowe, 1999a; Wallace, 2005, secs. 2.3.6–2.3.9] the first part results in no output being written but rather the Turing machine is programmed with the hypothesis, H. Now programmed with the hypothesis, H, the Turing machine now looks at the second part of the message (which is possibly the output of a Huffman code) and uses H to write out the data, D. The MML inference will be the hypothesis, H, which is represented by the first part of the shortest two-part input giving rise to the data, D. The other thing to mention here is the Bayesianism inherent in all these approaches. The Bayesian and (two-part [file compression]) information-theoretic interpretations to MML are both clearly Bayesian. And, although some authors have been known to neglect this (by sweeping Order 1, O(1), terms under the carpet or otherwise neglecting them), the choice of (Universal) Turing Machine in algorithmic information theory is (obviously?) also a Bayesian choice [Wallace and Dowe, 1999a, secs. 2.4 and 7; 1999c, secs. 1–2; Comley and Dowe, 2005, p. 269, sec. 11.3.2; Dowe, 2008a, footnotes 211, 225 and (start of) 133, and sec. 0.2.7, p. 546; 2008b, p. 450].
2.5
Digression on Wallace non-universality probability
This section is a digression and can be safely skipped without any loss of continuity or context, but it does follow on from sec. 2.4 — which is why it is placed here. The Wallace non-universality probability [Dowe, 2008a, sec. 0.2.2, p. 530, col. 1 and footnote 70] of a UTM, U , is the probability that, given a particular infinitely long random bit string as input, U will become non-universal at some point. Quite clearly, the Wallace non-universality probability (WNUP) equals 1 for all nonuniversal TMs. Similarly, WNUP(U ) is greater than 0 for all TMs, U ; and WNUP equals 1 for some UTM if and only if it equals 1 for all UTMs. Wallace, others and I believed it to equal 1. In unpublished private communication, George Barmpalias argues that it isn’t equal to 1, appealing to a result of Kucera. George is correct (and Chris and I mistaken) if and only if inf {U :U a UTM} WNUP(U) = 0. This section was a digression and could be safely skipped without any loss of continuity or context.
3
PROBABILISTIC INFERENCE, LOG-LOSS SCORING AND KULLBACK-LEIBLER DISTANCE — AND UNIQUENESS
There are many measures of predictive accuracy. The simplest of these, such as on a quiz show, is the number of correct answers (or “right”/“wrong” scoring). There are likewise many measures of how close some estimated function is to the true function from which the data is really coming. We shall present the notions of probabilistic scoring and of measuring a distance between two functions.
914
David L. Dowe
From the notion of probabilistic scoring, we shall present our new apparent uniqueness property of log-loss scoring [Dowe, 2008a, footnote 175 (and 176); 2008b, pp. 437–438]. From the notion of measuring a distance between two functions, we shall present a related result showing uniqueness (or two versions of uniqueness) of Kullback-Leibler distance [Dowe, 2008a, p. 438]. For those interested in causal decision theory and scoring rules and to those simply interested in scoring rules and scoring probabilities, I highly recommend log-loss scoring and Kullback-Leibler distance — partly for their invariance and partly for their apparent uniqueness in having this invariance.
3.1
“Right”/“wrong” scoring and re-framing
Imagine two different quizzes which are identical apart from their similar but not quite identical beginnings. Quiz 1 begins with a multiple-choice question with 4 possible answers: 0, 1, 2, 3 or (equivalently, in binary) 00, 01, 10, 11. Quiz 2 begins with 2 multiple-choice questions: • Q2.1: is the 2nd last bit a 0 or a 1?, and • Q2.2: is the last bit a 0 or a 1? Getting a score of 1 correct at the start of quiz 1 corresponds to getting a score of 2 at the start of quiz 2. Getting a score of 0 at the start of quiz 1 corresponds to a score of either 0 or 1 at the start of quiz 2. This seems unfair — so we might try to attribute the problem to the fact that quiz 1 began with 1 4-valued question where quiz 2 began with 2 2-valued questions and explore whether all is fair when (e.g.) all quizzes have 2 2-valued questions. But consider now quiz 3 which, like quiz 2, begins with 2 2-valued questions, as follows: • Q3.1: is the 2nd last bit a 0 or a 1?, and • Q3.2: are the last two bits equal or not equal? Getting Q3.2 correct means that on quiz 2 we either get 0 (if we get Q2.1 wrong) or 2 (if we get Q2.2 correct and therefore all questions correct). We see that no matter how we re-frame the question — whether as one big question or as lots of little questions — we get all answers correct on one quiz if and only if we get all answers correct on all quizzes. But, however, as the following example (of Quiz 4 and Quiz 5) demonstrates, we also see that even when all questions are binary (yes/no), it is possible to have two different framings of n questions such that in one such quiz (here, Quiz 4) we have (n − 1) questions answered correctly (and only 1 incorrectly) and in the re-framing to another quiz (here, Quiz 5) all n questions are answered incorrectly.
MML, Hybrid Bayesian Network Graphical Models, ...
915
Quiz 4 (of n questions): • Q4.1: What is the 1st of the n bits? • Q4.i (i = 2, ..., n): Is the 1st bit equal to the ith bit? Quiz 5 (of n questions): • Q5.1: What is the 1st of the n bits? • Q5.i (i = 2, ..., n): What is the ith bit? If the correct bit string is 0n = 0...0 and our guess is 1n = 1...1, then on Quiz 4 we will get (n − 1) correct (and 1 wrong) whereas on quiz 5 we will get 0 correct (and all n wrong). This said by way of motivation, we now look at forms of prediction that remain invariant to re-framing — namely, probabilistic prediction with log(arithm)-loss — and we present some recent uniqueness results [Dowe, 2008a, footnote 175 (and 176); 2008b, pp. 437-438] here.
3.2
Scoring predictions, probabilistic predictions and log-loss
The most common form of prediction seems to be a prediction without a probability or anything else to quantify it. Nonetheless, in some forms of football, the television broadcaster sometimes gives an estimated probability of the kicker scoring a goal — based on factors such as distance, angle and past performance. And, of course, if it is possible to wager a bet on the outcome, then accurately estimating the probability (and comparing this with the potential pay-out if successful) will be of greater interest. Sometimes we don’t care overly about a probability estimate. On some days, we might merely wish to know whether or not it is more probable that it will rain or that it won’t. On such occasions, whether it’s 52% probable or 97% probable that it will rain, we don’t particularly care beyond noting that both these numbers are greater than 50% and we’ll take our umbrella with us in either case. And sometimes we most certainly want a good and reliable probability estimate. For example, a patient reporting with chest pains doesn’t want to be told that there’s only a 40% chance that you’re in serious danger (with a heart attack), so you can go now. And nor does an inhabitant of an area with the impending approach of a raging bush-fire want to be told that there’s only a 45% chance of your dying or having serious debilitation if you stay during the fire, so you might as well stay. The notion of “reasonable doubt” in law is pertinent here — and, without wanting to seem frivolous, so, too, is the notion of when a cricket umpire should or shouldn’t give the “benefit of the doubt” to the person batting (in l.b.w. or other contentious decisions). Now, it is well-known that with logarithm-loss function (log p) for scoring probabilistic predictions, the optimal strategy is to give the true probability, if known.
916
David L. Dowe
This property also holds true for quadratic loss ((1 − p)2 ) and has also been shown to be able to hold for certain other functions of probability [Deakin, 2001]. What we will show here is our new result that the logarithm-loss (or log-loss) function has an apparent uniqueness property on re-framing of questions [Dowe, 2008a, footnote 175 (and 176); 2008b, pp. 437–438]. Let us now consider an example involving (correct) diagnosis of a patient. (With apologies to any and all medically-informed human earthlings of the approximate time of writing, the probabilities in the discussion(s) below might be from nonearthlings, non-humans and/or from a different time.) We’ll give four possibilities: 1. no diabetes, no hypothyroidism 2. diabetes, but no hypothyroidism 3. no diabetes, but hypothyroidism 4. both diabetes and hypothyroidism. Of course, rather than present this as one four-valued diagnosis question, we could have presented it in a variety of different ways. As a second possibility, we could have also asked, e.g., the following two twovalued questions: 1. no diabetes 2. diabetes and 1. hypothyroidism 2. no hypothyroidism. As another (third) alternative line, we could have begun with 1. no condition present 2. at least one condition present, and then finished if there we no condition present but, if there were at least one condition present, instead then continued with the following 3-valued question: 1. diabetes, but no hypothyroidism 2. no diabetes, but hypothyroidism 3. both diabetes and hypothyroidism.
MML, Hybrid Bayesian Network Graphical Models, ...
917
To give a correct diagnosis, in the original setting, this requires answering one question correctly. In the second setting, it requires answering exactly two questions correctly — and in the third setting, it might require only answering one question correctly but it might require answering two questions correctly. Clearly, then, the number of questions answered correctly is not invariant to the re-framing of the question. However, the sum of logarithms of probabilities is invariant, and — apart from (quite trivially, multiplying it by a constant, or) adding a constant multiple of the entropy of the prior distribution - would appear to be unique in having this property. Let us give two examples of this. In the first example, our conditions will be independent of one another. In the second example, our conditions will be dependent upon one another. So, in the first case, with the conditions independent of one another, suppose the four estimated probabilities are 1. no diabetes, no hypothyroidism; probability 1/12 2. diabetes, but no hypothyroidism; probability 2/12 = 1/6 3. no diabetes, but hypothyroidism; probability 3/12 = 1/4 4. both diabetes and hypothyroidism; probability 6/12 = 1/2. Then, in the second possible way we had of looking at it (with the two given two-valued questions), in the first case we have 1. no diabetes; probability 1/3 2. diabetes; probability 2/3 and — because the diseases are supposedly independent of one another in the example — we have 1. no hypothyroidism; probability 1/4 2. hypothyroidism; probability 3/4. The only possible way we can have an additive score for both the diabetes question and the hypothyroid question separately is to use some multiple of the logarithms. This is because the probabilities are multiplying together and we want some score that adds across questions, so it must be (a multiple of) the logarithm of the probabilities. In the third alternative way that we had of looking at it, Pr(no condition present) = 1/12. If there is no condition present, then we do need need to ask the remaining question. But, in the event (of probability 11/12) that at least one condition is present, then we have 1. diabetes, but no hypothyroidism; probability 2/11
918
David L. Dowe
2. no diabetes, but hypothyroidism; probability 3/11 3. both diabetes and hypothyroidism; probability 6/11. And the logarithm of probability score again works. So, our logarithm of probability score worked when the conditions were assumed to be independent of one another. We now consider an example in which they are not independent of one another. Suppose the four estimated probabilities are 1. no diabetes, no hypothyroidism; probability 0.1 2. diabetes, but no hypothyroidism; probability 0.2 3. no diabetes, but hypothyroidism; probability 0.3 4. both diabetes and hypothyroidism; probability 0.4. Then, in the second possible way we had of looking at it (with the two given two-valued questions), for the first of the two two-valued questions we have 1. no diabetes; probability 0.4 2. diabetes; probability 0.6 and then for the second of which we have either 1. no hypothyroidism; prob(no hypothyroidism | no diabetes) = 0.1/(0.1 + 0.3) = 0.1/0.4 = 0.25 2. hypothyroidism; prob(hypothyroidism | no diabetes) = 0.3/(0.1 + 0.3) = 0.3/0.4 = 0.75 or 1. no hypothyroidism; prob(no hypothyroidism | diabetes) = 0.2/(0.2 + 0.4) = 0.2/0.6 = 1/3 2. hypothyroidism; prob(hypothyroidism | diabetes) = 0.4/(0.2+0.4) = 0.4/0.6 = 2/3. And in the third alternative way of looking at this, prob(at least one condition present) = 0.9. If there is no condition present, then we do need need to ask the remaining question. But, in the event (of probability 9/10) that at least one condition is present, then we have the following three-way question: 1. diabetes, but no hypothyroidism; probability 2/9 2. no diabetes, but hypothyroidism; probability 3/9 3. both diabetes and hypothyroidism; probability 4/9.
MML, Hybrid Bayesian Network Graphical Models, ...
919
We leave it to the reader to verify that the logarithm of probability score again works, again remaining invariant to the phrasing of the question. (Those who would rather see the above examples worked through with general probabilities rather than specific numbers are referred to a similar calculation in sec. 3.6.) Having presented in sec. 3.1 the problems with “right”/“wrong” scoring and having elaborated on the uniqueness under re-framing of log(arithm)-loss scoring [Dowe, 2008a, footnote 175 (and 176); 2008b, pp. 437-438] above, we next mention at least four other matters. First, borrowing from the spirit of an example from [Wallace and Patrick, 1993], imagine we have a problem of inferring a binary (2-class) output and we have a binary choice (or a binary split in a decision tree) with the following output distributions. For the “no”/“left” branch we get 95 in class 1 and 5 in class 2 (i.e., 95:5), and for the “yes”/“right” branch we get 55 in class 1 and 45 in class 2 (i.e., 55:45). Because both the “no”/“left” branch and the “yes”/“right” branch give a majority in class 1, someone only interested in “right”/“wrong” score would fail to pick up on the importance and significance of making this split, simply saying that one should always predict class 1. Whether class 1 pertains to heart attack, dying in a bush-fire or something far more innocuous (such as getting wet in light rain), by reporting the probabilities we don’t run the risk of giving the wrong weights to (so-called) type I and type II errors (also known as false positives and false negatives). (Digressing, readers who might incidentally be interested in applying Minimum Message Length (MML) to hypothesis testing are referred to [Dowe, 2008a, sec. 0.2.5, p. 539 and sec. 0.2.2, p. 528, col. 1; 2008b, p. 433 (Abstract), p. 435, p. 445 and pp. 455–456; Musgrave and Dowe, 2010].) If we report a probability estimate of (e.g.) 45%, 10%, 5%, 1% or 0.1%, we leave it to someone else to determine the appropriate level of risk associated with a false positive in diagnosing heart attack, severe bush-fire danger or getting caught in light drizzle rain. And we now mention three further matters. First, some other uniqueness results of log(arithm)-loss scoring are given in [Milne, 1996; Huber, 2008]. Second, in binary (yes/no) multiple-choice questions, it is possible (and not improbable for a small number of questions) to serendipitously fluke a good “right”/“wrong” score, even if the probabilities are near 50%-50%, and with little or no risk of downside. But with log(arithm)-loss scoring, if the probabilities are near 50%-50% (or come from random noise and are 50%-50%), then predictions with more extreme probabilities are fraught with risk. And, third, given all these claims about the uniqueness of log(arithm)-loss scoring in being invariant to re-framing (as above) [Dowe, 2008a, footnote 175 (and 176); 2008b, pp. 437–438] and having other desirable properties [Milne, 1996; Huber, 2008], we discuss in sec. 3.3 how a quiz show contestant asked to name (e.g.) a city or a person and expecting to be scored on “right”/“wrong” could instead give a probability distribution over cities or people’s names and be scored by
920
David L. Dowe
log(arithm)-loss.
3.3
Probabilistic scoring on quiz shows
Very roughly, one could have probabilistic scoring on quiz shows as follows. For a multiple-choice answer, things are fairly straightforward as above. But what if, e.g., the question asks for a name or a date (such as a year)? One could give a probability distribution over the length of the name. For example, the probability that the name has length l might be 1/2l = 2−l for l = 1, 2, .... Then, for each of the l places, there could be 28 possibilities (the 26 possibilities a, ..., z, and the 2 possibilities space “ ” and hyphen “-”) of probability 1/28 each. So, for example, as a default, “Wallace” would have a probability of 2−7 × (1/28)7 . Call this distribution Default. (Of course, we could refine Default by, e.g., noticing that the first character will neither be a space “ ” nor a hyphen “-” and/or by (also) noticing that both the space “ ” and the hyphen “-” are never followed by a space “ ” or a hyphen “-”.) For the user who has some idea rather than no proverbial idea about the answer, it is possible to construct hybrid distributions. So, if we wished to allocate probability 1/2 to “Gauss” and probability 1/4 to “Fisher” and otherwise we had no idea, then we could give a probability of 1/2 to “Gauss”, 1/4 to “Fisher” and for all other answers our probability we would roughly be 1/4 × Default. A similar comment applies to someone who thinks that (e.g.) there is a 0.5 probability that the letter “a” occurs at the start and a 0.8 probability that the letter “c” appears at least once or instead that (e.g.) a letter “q” not at the end can only be followed by an “a”, a “u”, a space “ ” or a hyphen “-”. If a question were to be asked about the date or year of such and such an event, one could give a ([truncated] Gaussian) distribution on the year in the same way that entrants having been giving (Gaussian) distributions on the margin of Australian football games since early 1996 as per sec. 3.5. It would be nice to see a (television) quiz (show) one day with this sort of scoring system. One could augment the above as follows. Before questions were asked to single contestants — and certainly before questions were thrown upon to multiple contestants and grabbed by the first one pressing the buzzer — it could be announced what sort of question it was (e.g., multiple-choice [with n options] or open-ended) and also what the bonus in bits was to be. The bonus in bits (as in the constant to be added to the logarithm of the contestant’s allocated probability to the correct answer) could relate to the ([deemed] prior) difficulty of the question, (perhaps) as per sec. 3.4 — where there is a discussion of adding a term corresponding to (a multiple of) the log(arithm) of the entropy of some Bayesian prior over the distribution of possible answers. And where quiz shows have double points in the final round, that same tradition could be continued. One comment perhaps worth adding here is that in typical quiz shows, as also in the probabilistic and Gaussian competitions running on the Australian Football League (AFL) football competitions from sec. 3.5, entrants are regularly updated
MML, Hybrid Bayesian Network Graphical Models, ...
921
on the scores of all their opponents. In the AFL football competitions from sec. 3.5, this happens at the end of every round. In quiz shows, this often also happens quite regularly, sometimes after every question. The comment perhaps worth adding is that, toward the end of a probabilistic competition, contestants trying to win who are not winning and who also know that they are not winning might well nominate “aggressive” choices of probabilities that do not represent their true beliefs and which they probably would not have chosen if they had not been aware that they weren’t winning. Of course, if (all) this seems improbable, then please note from the immediately following text and from sec. 3.5 that each year since the mid-1990s there have been up to hundreds of people (including up to dozens or more school students, some in primary school) giving not only weekly probabilistic predictions on results of football games but also weekly predictions of probability distributions on the margins of these games. And these have all been scored with log(arithm)-loss.
3.4
Entropy of prior and other comments on log-loss scoring
Before finishing with some references to papers in which we have used log(arithm)loss probabilistic (“bit cost”) scoring and mentioning a log-loss probabilistic competition we have been running on the Australian Football League (AFL) since 1995 in sec. 3.5, two other comments are worth making in this section. Our first comment is to return to the issue of quadratic loss ((1 − p)2 , which is a favourite of many people) and also the loss functions suggested more recently by Deakin [2001]. While it certainly appears that only log(arithm)-loss (and multiples thereof) retains invariance under re-framing, we note that − log(pn1 pn2 . . . pnm ) = −n
m X
log pi
(4)
i=1
So, although quadratic loss ((1 − p)2 ) does not retain invariance when adding scores between questions, if we were for some reason to want to multiply scores between questions (rather than add, as per usual), then the above relationship in equation (4) between a power score (quadratic score with n = 2) and log(arithmic) score — namely, that the log of a product is the sum of the logs — might possibly enable some power loss to be unique in being invariant under re-framing (upon multiplication). The other comment to make about the uniqueness of log(arithm)-loss (upon addition) under re-framing is that we can also add a term corresponding to (a multiple of) the entropy of the Bayesian prior [Tan and Dowe, 2006, sec. 4.2; Dowe, 2008a, footnote 176; 2008b, p. 438]. (As is hinted at in [Tan and Dowe, 2006, sec. 4.2, p. 600] and explained in [Dowe, 2008a, footnote 176], the idea for this arose in December 2002 from my correcting a serious mathematical flaw in [Hope and Korb, 2002]. Rather than more usefully use the fact that a logarithm of probability ratios is the difference of logarithms of probabilities, [Hope and Korb,
922
David L. Dowe
2002] instead suggests using a ratio of logarithms — and this results in a system where the optimal score will rarely be obtained by using the true probability.) Some of many papers using log(arithm)-loss scoring include [Good, 1952] (where it was introduced for the binomial distribution) and [Dowe and Krusel, 1993, p. 4, Table 3; Dowe et al., 1996d; Dowe et al., 1996a; Dowe et al., 1996b; Dowe et al., 1996c; Dowe et al., 1998, sec. 3; Needham and Dowe, 2001, Figs. 3–5; Tan and Dowe, 2002, sec. 4; Kornienko et al., 2002, Table 2; Comley and Dowe, 2003, sec. 9; Tan and Dowe, 2003, sec. 5.1; Comley and Dowe, 2005, sec. 11.4.2; Tan and Dowe, 2004, sec. 3.1; Kornienko et al., 2005a, Tables 2–3; Kornienko et al., 2005b; Tan and Dowe, 2006, secs. 4.2–4.3] (and possibly also [Tan et al., 2007, sec. 4.3]), [Dowe, 2007; 2008a, sec. 0.2.5, footnotes 170–176 and accompanying text; 2008b, pp. 437–438]. The 8-class multinomial distribution in [Dowe and Krusel, 1993, p. 4, Table 3] (from 1993) is the first case we are aware of in which log-loss scoring is used for a distribution which is not binomial, and [Dowe et al., 1996d; 1996a; 1996b; 1996c] (from 1996) are the first cases we are aware of in which log-loss scoring was used for the Normal (or Gaussian) distribution.
3.5
Probabilistic prediction competition(s) on Australian football
A log(arithm)-loss probabilistic prediction competition was begun on the outcome of Australian Football League (AFL) matches in 1995 [Dowe and Lentin, 1995; Dowe, 2008b, p. 48], just before Round 3 of the AFL season. In 1996, this was extended by the author to a log(arithm)-loss Gaussian competition on the margin of the game, in which competition entrants enter a µ and a σ — in order to give a predictive distribution N (µ, σ) on the margin — for each game [Dowe et al., 1996d; 1996a; 1996b; 1996c; 1998, sec. 3; Dowe, 2008a, sec. 0.2.5]. These log-loss compression-based competitions, with scores in bits (of information), have been running non-stop ever since their inceptions, having been put on the WWW in 1997 and at their current location of www.csse.monash.edu.au/~footy since 1998 [Dowe, 2008a, footnote 173]. (And thanks to many, especially Torsten Seemann for all the unsung behind the scenes support in keeping these competitions going [Dowe, 2008a, footnote 217].) The optimal long-term strategy in the log(arithm)loss probabilistic AFL prediction competition would be to use the true probability if one knew it. Looking ahead to sec. 3.6, the optimal long-term strategy in the Gaussian competition would be to choose µ and σ so as to minimise the KullbackLeibler distance from the (true) distribution on the margin to N (µ, σ 2 ). (In the log-loss probabilistic competition, the “true” probability is also the probability for which the Kullback-Leibler distance from the (true) distribution is minimised.) Competitions concerned with minimising sum of squared errors can still be regarded as compression competitions motivated by (expected) Kullback-Leibler distance minimisation, as they are equivalent to Gaussian competitions with σ fixed (where σ can be presumed known or unknown, as long as it’s fixed).
MML, Hybrid Bayesian Network Graphical Models, ...
3.6
923
Kullback-Leibler “distance” and measuring “distances” between two functions
Recall from sec. 2.3 that the optimal way of encoding some probability distribution, f, P is with code-words of length − log f , and that the average (or expected) cost m is i=1 fi × (− log fi ), also known as the entropy. The log(arithm)-loss score is obtained by sampling from some real-world data. If we are sampling from some true (known) distribution, then the optimal long-run average that we can expect will be the entropy. One way of thinking about the Kullback-Leibler divergence (or “distance”) between two distributions, f and g, is as the inefficiency (or sub-optimality) of encoding f using g rather than (the optimal) f . Equivalently, one can think about the Kullback-Leibler distance between two distributions, f and g, as the average (or expected) cost of sampling from distribution f and coding with the corresponding cost P of − log g minus the entropy of f — and, of course, the entropy of f (namely, − f log f ) is independent of g. Recalling equations (2) and (3) for the entropy of discrete and continuous models respectively, the Kullback-Leibler distance from f to g is m m X X f × (− log fi )) = f × (− log gi )) − ( ( i=1
i=1
m X i=1
f × log(fi /gi )
(5)
for the discrete distribution. As one might expect, for a continuous distribution, the Kullback-Leibler distance from f to g is Z Z ( f (x) × (− log g(x)) dx) − ( f (x) × (− log f (x)) dx) Z Z = f × log(f (x)/g(x)) dx = dx f × log(f /g) (6) We should mention here that many refer to the Kullback-Leibler “distance” as Kullback-Leibler divergence because it is not — in general — symmetrical. In other words, there are plenty of examples when KL(f, g), or, equivalently, ∆(g||f ), is not equal to KL(g, f ) = ∆(f ||g). In sec. 3.2, we showed a new result about uniqueness of log(arithm)-loss scoring in terms of being invariant under re-framing of the problem [Dowe, 2008a, footnote 175 (and 176); 2008b, pp. 437–438]. It turns out that there is a similar uniqueness about Kullback-Leibler distance in terms of being invariant to re-framing of the problem [Dowe, 2008b, p. 438]. Despite some mathematics to follow which some readers might find slightly challenging in places, this follows intuitively because (the entropy of the true distribution, f , is independent of g and) the − log g term in the Kullback-Leibler distance is essentially the same as the log(arithm)-loss term in sec. 3.2. Before proceeding to this example, we first note that the Kullback-Leibler distance is quite clearly invariant under re-parameterisations such as (e.g.) transforming from polar co-ordinates (x, y) to Cartesian co-ordinates (r = sign(x) .
924
David L. Dowe
p
x2 + y 2 , θ = tan−1 (y/x)) and back from Cartesian co-ordinates (r, θ) to polar co-ordinates (x = r cos θ, y = r sin θ). We give an example of our new result (about the uniqueness of Kullback-Leibler distance in terms of being invariant to re-framing) below, letting f and g both have 4 states, with probabilities f1,1 , f1,2 , f2,1 , f2,2 , g1,1 , g1,2 , g2,1 and g2,2 respectively. The reader is invited to compare the example below with those from sec. 3.2. The reader who would prefer to see specific probabilities rather than these general probabilities is suggested more strongly to compare with sec. 3.2.
KL(f, g) = ∆(g||f ) =
2 2 X X i=1 j=1
=
2 2 X X
fi,j (log fi,j − log gi,j )
fi,j log(fi,j /gi,j )
(7)
i=1 j=1
Another way of looking at the Kullback-Leibler “distance” (or Kullback-Leibler divergence, or KL-distance) is to say that a proportion f1,· = f1,1 +f1,2 of the time, we have the Kullback-Leibler distance to the corresponding cross-section (g1,· ) of g, and then the remaining proportion 1 − f1,· = f2,· = f2,1 + f2,2 of the time, we have the Kullback-Leibler distance to the corresponding (other) cross-section (g2,· ) of g. To proceed down this path, we have to do the calculations at two levels. (Analogously with sec. 3.2, we could do the calculation in one step involving four terms or break it up into two levels of parts each involving two terms.) At the top level, we have to calculate the KL-divergence from the binomial distribution (f1,· , f2,· ) to the binomial distribution (g1,· , g2,· ). This top-level KL-divergence is f1,· log(f1,· /g1,· ) + f2,· log(f2,· /g2,· )
(8)
It then remains to go to the next level (or step) and first look at the binomial distribution f1,1 /(f1,1 + f1,2 ) and f1,2 /(f1,1 + f1,2 ) versus g1,1 /(g1,1 + g1,2 ) and g1,2 /(g1,1 + g1,2 ) (on the first or left branch), and then to look at the binomial distribution f2,1 /(f2,1 + f2,2 ) and f2,2 /(f2,1 + f2,2 ) versus g2,1 /(g2,1 + g2,2 ) and g2,2 /(g2,1 + g2,2 ) (on the second or right branch). Note that the first of these (KLdivergences or) coding inefficiencies will only occur a proportion f1,1 + f1,2 = f1,· of the time, and the second of these (KL-divergences or) coding inefficiencies will only occur a proportion f2,1 + f2,2 = f2,· = 1 − f1,· of the time. The first of these KL-divergences is the (expected or) average coding inefficiency when encoding (f1,1 /(f1,1 + f1,2 ), f1,2 /(f1,1 + f1,2 )) not using itself, but rather instead (sub-optimally) using (g1,1 /(g1,1 + g1,2 ), g1,2 /(g1,1 + g1,2 )). This first KLdivergence is f1,1 /(f1,1 + f1,2 ) log((f1,1 /(f1,1 + f1,2 ))/(g1,1 /(g1,1 + g1,2 ))) +
f1,2 /(f1,1 + f1,2 ) log((f1,2 /(f1,1 + f1,2 ))/(g1,2 /(g1,1 + g1,2 )))
MML, Hybrid Bayesian Network Graphical Models, ...
= + = +
925
(f1,1 /(f1,1 + f1,2 )) × ((log f1,1 /g1,1 ) − (log((f1,1 + f1,2 )/(g1,1 + g1,2 )))) (f1,2 /(f1,1 + f1,2 )) × ((log f1,2 /g1,2 ) − (log((f1,1 + f1,2 )/(g1,1 + g1,2 ))))
(f1,1 /f1,· ) × (log(f1,1 /g1,1 ) − log(f1,· /g1,· )) (f1,2 /f1,· ) × (log(f1,2 /g1,2 ) − log(f1,· /g1,· ))
(9)
Changing the very first subscript from a 1 to a 2, we then get that the second of these KL-divergences, namely the KL-divergence from the binomial distribution (f2,1 /(f2,1 + f2,2 )), f2,2 /(f2,1 + f2,2 )) to the binomial distribution (g2,1 /(g2,1 + g2,2 )), g2,2 /(g2,1 + g2,2 )), is f2,1 /(f2,1 + f2,2 ) log((f2,1 /(f2,1 + f2,2 ))/(g2,1 /(g2,1 + g2,2 ))) + = + = +
f2,2 /(f2,1 + f2,2 ) log((f2,2 /(f2,1 + f2,2 ))/(g2,2 /(g2,1 + g2,2 ))) (f2,1 /(f2,1 + f2,2 )) × ((log f2,1 /g2,1 ) − (log((f2,1 + f2,2 )/(g2,1 + g2,2 ))))
(f2,2 /(f2,1 + f2,2 )) × ((log f2,2 /g2,2 ) − (log((f2,1 + f2,2 )/(g2,1 + g2,2 )))) (f2,1 /f2,· ) × (log(f2,1 /g2,1 ) − log(f2,· /g2,· )) (f2,2 /f2,· ) × (log(f2,2 /g2,2 ) − log(f2,· /g2,· ))
(10)
The first coding inefficiency, or KL-divergence, given in equation (9), occurs a proportion f1,· = (f1,1 + f1,2 ) of the time. The second coding inefficiency, or KL-divergence, given in equation (10), occurs a proportion f2,· = (f2,1 + f2,2 ) = (1 − (f1,1 + f1,2 )) = 1 − f1,· of the time. So, the total expected (or average) coding inefficiency of using g when we should be using f , or equivalently the KL-divergence from f to g, is the following sum: the inefficiency from equation (8) + (f1,· × (the inefficiency from equation (9))) + (f2,· × (the inefficiency from equation (10))). Before writing out this sum, we note that
= +
(f1,· × (the inefficiency from equation (9))) f1,1 × (log(f1,1 /g1,1 ) − log(f1,· /g1,· )) f1,2 × (log(f1,2 /g1,2 ) − log(f1,· /g1,· ))
(11)
and similarly that
= +
(f2,· × (the inefficiency from equation (10))) f2,1 × (log(f2,1 /g2,1 ) − log(f2,· /g2,· )) f2,2 × (log(f2,2 /g2,2 ) − log(f2,· /g2,· ))
Now, writing out this sum, summing equations (8), (11) and (12), it is
+ +
f1,· log(f1,· /g1,· ) + f2,· log(f2,· /g2,· ) f1,1 × (log(f1,1 /g1,1 ) − log(f1,· /g1,· ))
f1,2 × (log(f1,2 /g1,2 ) − log(f1,· /g1,· ))
(12)
926
David L. Dowe
+ + =
=
f2,1 × (log(f2,1 /g2,1 ) − log(f2,· /g2,· )) f2,2 × (log(f2,2 /g2,2 ) − log(f2,· /g2,· ))
f1,1 log(f1,1 /g1,1 ) + f1,2 log(f1,2 /g1,2 ) +f2,1 log(f2,1 /g2,1 ) + f2,2 log(f2,2 /g2,2 ) 2 2 X X
fi,j log(fi,j /gi,j )
i=1 j=1
=
2 2 X X i=1 j=1
fi,j (log fi,j − log gi,j ) = KL(f, g) = ∆(g||f )
(13)
thus reducing to our very initial expression (7). Two special cases are worth noting. The first special case is when events are independent — and so fi,j = fi,· × φj = fi × φj and gi,j = gi,· × γj = gi × γj for some φ1 , φ2 , γ1 and γ2 (and f1 + f2 = 1, g1 + g2 = 1, φ1 + φ2 = 1 and γ1 + γ2 = 1). In this case, following from equation (7), we get KL(f, g)
=
∆(g||f ) =
2 2 X X
fi,j log(fi,j /gi,j ) =
2 2 X X
fi φj log(fi φj /(gi γj ))
i=1 j=1
i=1 j=1
=
2 2 X X
(fi φj log(fi /gi ) + fi φj log(φj /γj ))
i=1 j=1
=
2 2 X X φj log(φj /γj )) fi log(fi /gi )) + ( (
=
2 2 X X φj log(1/γj )) fi log(1/gi )) + ( ( j=1
i=1
2 2 X X
−( =
(14)
j=1
i=1
i=1 j=1
−fi φj log(fi φj ))
2 2 X X φj log(1/γj )) − (Entropy of f) fi log(1/gi )) + ( (
(15)
(16)
j=1
i=1
We observe in this case of the distributions being independent that the KullbackLeibler scores de-couple in exactly the same uniquely invariant way as they do for the probabilistic predictions in sec. 3.2. The second special case of particular note is when probabilities are correct in both branching paths — i.e., f1,1 /f1,· = g1,1 /g1,· , f1,2 /f1,· = g1,2 /g1,· , f2,1 /f2,· = g2,1 /g2,· and f2,2 /f2,· = g2,2 /g2,· . In this case, starting from equation (7), we get KL(f, g)
=
∆(g||f ) =
2 2 X X i=1 j=1
fi,j log(fi,j /gi,j )
MML, Hybrid Bayesian Network Graphical Models, ...
=
2 2 X X
927
fi,· (fi,j /fi,· ) log((fi,· (fi,j /fi,· ))/(gi,· (gi,j /gi,· )))
i=1 j=1
=
2 2 X X i=1 j=1
=
2 2 X X
fi,· (fi,j /fi,· ) log((fi,· /gi,· ) × [(fi,j /fi,· )/(gi,j /gi,· )]) fi,· (fi,j /fi,· ) log(fi,· /gi,· )
i=1 j=1
=
2 X i=1
fi,· log(fi,· /gi,· ) =
2 X i=1
fi,· log(1/gi,· ) − (Entropy of f) (17)
We observe in this second special case of probabilities being correct in both branching paths that the divergence between the two distributions is the same as that between the binomial distributions (f1,· , f2,· = 1 − f1,· ) and (g1,· , g2,· = 1 − g1,· ), exactly as it should and exactly as it would be for the uniquely invariant log(arithm)-loss (“bit cost”) scoring of the probabilistic predictions in sec. 3.2. Like the invariance of the log(arithm)-loss scoring of probabilistic predictions under re-framing (whose uniqueness is introduced in [Dowe, 2008a, footnote 175 (and 176)] and discussed in [Dowe, 2008b, pp. 437–438] and sec. 3.2), this invariance of the Kullback-Leibler divergence to the re-framing of the problem is [Dowe, 2008b, p. 438] due to the fact(s) that (e.g.) log(f /g) = log f − log g and − log(f1,1 /(f1,1 + f1,2 )) + log f1,1 = log(f1,1 + f1,2 ). A few further comments are warranted by way of alternative measures of “distance” or divergence between probability distributions. First, where one can define the notion of the distance remaining invariant under re-framing for these, the Bhattacharyya distance, Hellinger distance and Mahalanobis distance are all not invariant under re-framing. Versions of distance or divergence based on the R´enyi entropy give invariance in the trivial case that α = 0 (where the distance will always be 0) and the case that α = 1 (where we get the Shannon entropy and, in turn, the Kullback-Leibler distance that we are currently advocating). A second further comment is that, just as in [Tan and Dowe, 2006, sec. 4.2], we can also add a term corresponding to (a multiple of) the entropy of the Bayesian prior. Just like the Kullback-Leibler divergence (and any multiple of it), this (and any multiple of it) will also remain invariant under re-parameterisation or other re-framing [Dowe, 2008a, footnote 176; 2008b, p. 438]. A third — and important — further comment is that it is not just the KullbackLeibler distance from (say, the true distribution) f to (say, the inferred distribution) g, KL(f, g) = ∆(g||f ), that is invariant and appears to be uniquely invariant under re-framing, but clearly also KL(g, f ) = ∆(f ||g) is invariant, as will also be a sum of any linear combination of these, such as (e.g.) αKL(f, g) + (1−α)KL(g, f ) (with 0 ≤ α ≤ 1, although this restriction is not required for invariance) [Dowe, 2008b, p. 438]. The case of α = 1/2 gives the symmetric Kullback-Leibler distance. The notion of Kullback-Leibler distance can be extended quite trivially — as
928
David L. Dowe
above — to hybrid continuous and discrete Bayesian net graphical models [Tan and Dowe, 2006, sec. 4.2; Dowe 2008a, sec. 0.2.5; 2008b, p. 436] (also see sec. 7.6) or mixture models [Dowe, 2008b, p. 436], etc. For the hybrid continuous and discrete Bayesian net graphical models in [Comley and Dowe, 2003; 2005] (which resulted at least partly from theory advocated in [Dowe and Wallace, 1998]), the log-loss scoring approximation to Kullback-Leibler distance has been used [Comley and Dowe, 2003, sec. 9]. 4
OCKHAM’S RAZOR (AND MISUNDERSTANDINGS) AND MML
Let us recall Minimum Message Length (MML) from secs. 1 and 2.4, largely so that we can now compare and contrast MML with Ockham’s razor (also written as Occam’s razor). Ockham’s razor, as it is commonly interpreted, says that if two theories fit the data equally well then prefer the simplest (e.g., [Wallace, 1996b, sec. 3.2.2, p. 48, point b]). Re-phrasing this in statistical speak, if P r(D|H1 ) = P r(D|H2 ) and P r(H1 ) > P r(H2 ), then Ockham’s razor advocates that we prefer H1 over H2 — as would also MML, since P r(H1 )P r(D|H1 ) > P r(H2 )P r(D|H2 ). It is not clear what — if anything — Ockham’s razor says in the case that P r(H1 ) > P r(H2 ) but P r(D|H1 ) < P r(D|H2 ), although MML remains applicable in this case by comparing P r(H1 ) × P r(D|H1 ) with P r(H2 ) × P r(D|H2 ). In this sense, I would at least contend that MML can be thought of as a generalisation of Ockham’s razor — for MML tells us which inference to prefer regardless of the relationships of P r(H1 ) with P r(H2 ) and P r(D|H1 ) with P r(D|H2 ), but it is not completely clear what Ockham’s razor per se advocates unless we have that P r(D|H1 ) = P r(D|H2 ). Our earlier arguments (e.g., from sec. 1) tell us why the MML theory (or hypothesis) can be thought of as the most probable hypothesis. Informal arguments of Chris Wallace’s from [Dowe, 2008a, footnote 182] (in response to questions [Dowe and Hajek, 1997, sec. 5.1; 1998, sec. 5]) suggest that, if P r(D|H1 ) = P r(D|H2 ) and P r(H1 ) > P r(H2 ), then we expect H1 to be a better predictor than H2 . But, in addition, there is also an alternative, more general and somewhat informal argument for, in general, preferring the predictive power of one hypothesis over another if the former hypothesis leads to a shorter two-part message length. This argument is simply that the theory of shorter two-part message length contributes more greatly (i.e., has a greater Bayesian weighting) in the optimal Bayesian predictor. In particular, the MML model will essentially have the largest weight in the predictor. In those cases where the optimal Bayesian predictor is statistically consistent (i.e., converges to any underlying data when given sufficient data), the optimal Bayesian predictor and the MML hypothesis appear always to converge. Several papers have been written with dubious claims about the supposed ineffectiveness of MML and/or of Ockham’s razor. Papers using inefficient (Minimum Description Length [MDL] or MML) coding schemes lead quite understandably to sub-optimal results — but a crucial point about minimum message length is to make sure that one has a reliable message length (coding scheme) before one
MML, Hybrid Bayesian Network Graphical Models, ...
929
sets about seeking the minimum of this “message length”. For corrections to such dubious claims in such papers by using better coding schemes to give better results (and sometimes vastly better coding schemes to get vastly better results), see (e.g.) examples [Wallace and Dowe, 1999a, secs. 5.1 and 7; 1999c, sec. 2; Wallace, 2005, sec. 7.3; Viswanathan et al., 1999; Wallace and Patrick, 1993; Comley and Dowe, 2005, secs. 11.3 and 11.4.3; Needham and Dowe, 2001; Wallace, 2005, sec. 5.1.2; Gr¨ unwald, 2007, sec. 17.4, An Apologetic Remark; Dowe, 2008a, p. 536] such as those in [Dowe, 2008a, footnote 18]. Not unrelatedly, a variety of misconceptions have led a variety of authors to make ill-founded criticisms of Ockham’s razor. One (such) interpretation (I think I should say, misinterpretation) of Ockham’s razor seems to go along the lines that Ockham’s razor supposedly advocates the simplest hypothesis, regardless of any data — and so (e.g.) DNA should supposedly be shaped in a single-helix rather than a double-helix. [And it seems a pity to misinterpret Ockham’s razor so — especially in a biological framework — because interpreting Ockham’s razor more properly using MML enables us to make a strong case that proteins fold with the Helices (and Extendeds) forming first and then the “Other” turn classes forming subsequently to accommodate these structures [Edgoose et al., 1998, sec. 6; Dowe et al., 1996, sec. 5, p. 253] (see also [Dowe et al., 1995]) [Wallace, 1998a, sec. 4.2; Dowe, 2008a, footnote 85; 2008b, p. 454]. What seems like a variation of this misconception is an argument in one paper that if we fit data from within some model family (such as fitting the data with a decision tree) and then subsequently find that a more complicated model predicts better, then this is somehow supposedly empirical evidence against Ockham’s razor. (See also a comment here from [Jorgensen and Gentleman, 1998, Some Criticisms].) Using MML as our (more general) form of Ockham’s razor, these supposed criticisms based on using overly simple models and paying insufficient attention to the data seem somewhat silly. For a discussion of the adverse consequences of not giving equals weights to the lengths of the two parts of an MML message, see, e.g., [Dowe, 2008a, footnote 130]. For those who would like every function — both the simple functions and the more complicated functions — to have the same prior probability, not only does this seem counter-intuitive, but — furthermore — it is not possible when there are infinitely many theories. When there are infinitely many theories, it necessarily follows that, as we look at progressively more and more complicated theories, it must necessarily follow that the prior probability must tend asymptotically to zero so that the countable sum over the prior probabilities of the theories can equal 1 (or unity). Another criticism of the Bayesian approach — and therefore of our Bayesian MML interpretation of Ockham’s razor — is that this approach can be undone by a pathological (sabotaging) form of prior. If we look at Bayes’s theorem (from sec. 1) and its consequences in terms of MML, we see what our intuition tells us — that we get a reasonable posterior distribution over the hypotheses if we start off with a
930
David L. Dowe
reasonable prior distribution. Our Bayesian priors (as we use them in problems of inference) should be somewhere between what we genuinely suspect a priori and (partly politically, so that we are less open to being accused of fudging our results, and perhaps partly to protect ourselves from ourselves) something innocuous (and seemingly “objective”). For problems where the number of parameters is bounded above — or grows sufficiently slowly when the amount of data increases — Bayesian inference will converge to the underlying model given sufficient data. To criticise Bayesian MML and/or Ockham’s razor after being sabotaged by a counter-intuitive and misrepresentative pathological prior is somewhat akin to criticising any inference method when the bulk of the relevant explanatory variables are not made available (or at least not made available until after much data has been seen) but in their stead is a plethora of essentially irrelevant variables.
4.1
Inference (or explanation) and prediction
Inference — also variously known as explanation [Wallace, 2005, sec. 1.1, first sentence and sec. 1.5], induction and/or inductive inference — pertains to finding the single best explanation for a body of data. Prediction pertains to the activity of anticipating the future, whether this is done using a single inference or a combination of more than one inference. To give an example, someone doing inference would be interested in a model of stock market prices which gives a theory of how the stock market works. An investor would certainly find that useful, but an investor would perhaps be more interested in whether prices are expected to be going up or down (and a probability distribution over these events and the magnitude of movement). To give a second example [Dowe et al., 2007, sec. 6.1.4], when two models of slightly different (Bayesian posterior) probabilities give substantially different answers, inference would advocate going with the more probable theory where prediction would advocate doing some sort of averaging of the theories. In the classical (non-Bayesian) approach, inference and prediction are perhaps the same thing. Certainly, an inference can be used to predict — and, to the classically (non-Bayesian) minded, prediction seems to be done by applying the single best inference. But, to the Bayesian, the best predictor will often result from combining more than one theory [Wallace, 1996b, sec. 3.6, p. 55; Oliver and Hand, 1996; Wallace and Dowe, 1999a, sec. 8; Tan and Dowe, 2006]. Herein lies a difference between the predictive approach of Solomonoff and the MML inductive inference approach of Wallace from sec. 2.4. By taking the single best theory, MML is doing induction. Despite the potentially confusing use of the term “Solomonoff induction” by some others, Solomonoff (is not doing induction [and not really inductive inference per se, either] but rather) is doing prediction [Solomonoff, 1996; Wallace, 1996b, sec. 3.6, p. 55; 2005, sec 10.1]. On the relative merits of induction (or [inductive] inference) vs prediction, there can be no doubting that humans acknowledge and reward the intelligence behind
MML, Hybrid Bayesian Network Graphical Models, ...
931
inductive inferences. When we ask for a list of great human intellects, whoever else is on the list, there will be people who have made prominent inductive inferences. Examples of such people and theories include Isaac Newton for the theory of gravity, Charles Darwin and Alfred Russel Wallace for the theory of natural selection and evolution, Albert Einstein for the theories of special and general relativity, Alfred Wegener for the theory of “continental drift”, and countless Nobel laureates and/or others in a variety of areas for their theories [Sanghi and Dowe, 2003, sec. 5.2]. (And when a human is paid the compliment of being called “perceptive”, my understanding of this term is that one thing that is being asserted is that this “perceptive” person is good at making inductive inferences about human behaviour.) Of course, those such theories as are accepted and whose developers are rewarded usually are not just good pieces of induction but typically also lead to good predictions. And whether or not predictions are done with the single best available theory or with a combination of theories, people are certainly interested in having good predictors. In trying to re-construct or restore a damaged image, the argument in support of inference is that we clearly want the single best inference rather than a probability distribution over all possible re-constructions. On the other hand, if there are a few inferences almost as good as the best (MML) inference, we would also like to see these alternative models [Dowe, 2008a, sec. 0.3.1]. Let me now essentially repeat this example but modify its context. If you think that the sentence you are currently reading is written in grammatically correct unambiguous English (and that you correctly understand the author’s intended meaning), then you are using several little innocuous inferences — such as (e.g.) the author and you (the reader) have at least sufficiently similar notions of English-language word meaning, English-language grammar and English-language spelling. However, if the writing were smudged, the spelling was questionable and the grammar and punctuation were poor (several of which can happen with the abbreviated form of a telegram, some e-mails or a mobile text message), inference would advocate going with the single best interpretation. A related case in point is automatic (machine) translation. Whether for the smudged poorly written (ambiguous) sentence or for automatic translation, prediction (in its pure platonic form) would advocate having a probability distribution over all interpretations. In reality, if there is one outstandingly clear interpretation to the sentence, then someone doing prediction would most probably be satisfied with this interpretation, (as it were) “beyond a reasonable doubt”. But, as with the damaged image, if there are a few inferences almost as good as the best (MML) inference, we would again also like to see these alternative models. The distinction between explanation (or inference, or inductive inference, or induction) and prediction is something which at least some other authors are aware of [Wallace and Dowe, 1999a, sec. 8; Wallace 2005, sec. 10.1.2; Shmueli and Koppius, 2007, Dowe et al., 2007, secs. 6.1.4, 6.3 and 7.2; Dowe, 2008b, pp. 439–440], and we believe that both have their place [Dowe et al., 2007, sec. 6.3]. Whether or not because of our newly discussed uniqueness in invariance properties of Kullback-
932
David L. Dowe
Leibler distance (from [Dowe, 2008a, p. 438] and sec. 3.6), some authors regard prediction as being about minimising the expected log-likelihood error — or equivalently, minimising the expected Kullback-Leibler distance between the true model (if there is one) and the inferred model. While the reasons (of many) for doing this might be (more) about minimising the expected log-likelihood error, the uniqueness in invariance properties of Kullback-Leibler distance suggest it is certainly a worthy interpretation of the term “prediction” and that doing prediction this way is worthy of further investigation. Recalling the invariance of the Kullback-Leibler distance (from, e.g., sec. 3.6), taking the Bayesian approach to minimising the expected Kullback-Leibler distance will be invariant under re-parameterisation (e.g., from polar to Cartesian co-ordinates) [Dowe et al., 1998; Wallace, 2005, secs. 4.7–4.9; Dowe et al., 2007, secs. 4 and 6.1.4; Dowe, 2008a, sec. 0.2.2]. Recalling α at the very end of sec. 3.6 [from the expression αKL(f, g) + (1−α)KL(g, f ) = α∆(g||f ) + (1−α)∆(f ||g)], the extreme of α = 1 sees us choose a function (g) so that the expected coding inefficiency of using our function (g) rather than the (ideal) truth (true function, f ) is minimised, weighting over our posterior distribution on f ; and the other extreme of α = 0 sees us choose a function (g) so that (under the hypothetical assumption that the data were being sampled from distribution, g) the expected inefficiency of using a function (f ) sampled from the (actual) Bayesian posterior rather than using our function (g) is minimised. Although both of these are statistically invariant, convention is that we are more interested in choosing a function of minimal expected coding inefficiency relative to the (ideal) truth (true function) — equivalently minimising the expected log-likelihood error (and hence choosing α = 1). As a general rule of thumb, the MML estimator lies between the Maximum Likelihood estimator (which is given to over-fitting) on the one hand and the Bayesian minEKL estimator (which is, curiously, given to under-fitting) on the other hand [Wallace, 2005, secs. 4.7–4.9]. (Wallace makes an excellent intuitive case for this in [Wallace, 2005, sec. 4.9].) Four examples of this are the multinomial distribution, the Neyman-Scott problem (see sec. 6.4) [Wallace, 2005, sec. 4.2–4.8], the “gap or no gap” (“gappy”) problem [Dowe et al., 2007, sec. 6.2.4 and Appendix B] and the bus number problem [Dowe, 2008a, footnote 116; 2008b, p. 440]. We outline these below, and then mention at the end not yet totally explored possible fifth and sixth (which would probably begin from [Schmidt and Makalic, 2009b]) examples. For the multinomial distribution, with counts s1 , ..., sm , ..., sM in classes 1, ..., m, ..., M respectively and S = s1 + ... + sm + ... + sM , Maximum Likelihood gives pˆm = sm /S. With a uniform prior, the minEKL estimator (also known as the Laplace estimate or the posterior mean) is (sm +1)/(S +M ), whereas the WallaceFreeman MML approximation [Wallace and Freeman, 1987; Wallace, 2005, sec. 5.4] with this same prior is (ˆ pm )M M L = (sm + 1/2)/(S + M/2). For the particular case of the “gap or no gap” (“gappy”) problem [Dowe et al., 2007, sec. 6.2.4 and Appendix B], data ({xi : 0 ≤ xi ≤ 1, i = 1, ..., N } for
MML, Hybrid Bayesian Network Graphical Models, ...
933
increasing N ) are being generated uniformly either from the closed interval [0, 1] or from a sub-region [0, a] ∪ [b, 1] for some a and b such that a < b. We see Maximum Likelihood and Akaike’s Information Criterion (AIC) over-fitting here, surmising a gap even whether there isn’t one. At the other extreme, we see the minEKL estimator under-fitting, stating no gap even in extreme cases such as (e.g.) [0, 0.001] ∪ [0.999, 1] with a = 0.001 and b = 0.999. The curious behaviour of minEKL is due to the fact that the posterior probability that the region is [0, 1] will get arbitrarily small for large N but never down to 0, and there is an infinite penalty in Kullback-Leibler distance for ascribing a probability of 0 to something which can actually happen. Unlike the over-fitting Maximum Likelihood and AIC, and unlike the under-fitting minEKL, MML behaves fine in both cases [Dowe et al., 2007]. For the Neyman-Scott problem (of sec. 6.4), see [Wallace, 2005, sec. 4.2–4.8]. For the bus number problem [Dowe, 2008a, footnote 116; 2008b, p. 440] (where we arrive in a new town with θ buses numbered consecutively from 1 to θ, and we see only one bus and observe its number, xobs , and are then asked to estimate the number of buses in the town), the Maximum Likelihood estimate is the number of the observed bus, xobs , which is an absolute lower bound and seems like a silly under-estimate. At the other extreme, minEKL will behave in similar manner to how it did with the abovementioned “gappy” problem. It will choose the largest positive integer (no less than xobs ) for which the prior (and, in turn, the posterior) is non-zero. In the event that the prior never goes to 0, it will return infinity. It seems fairly trivial that the MML estimate must fall between the Maximum Likelihood estimate (the lowest possible value) and the minEKL estimate (from a Bayesian perspective, the highest possible estimate). For further discussion of the behaviour of MML here, see [Dowe, 2008a, footnote 116]. In addition to the four examples we have just given of the MML estimate lying between (over-fitting) Maximum Likelihood and (under-fitting) minEKL, of possible interest along these lines as potential fifth and sixth examples worthy of further exploration, see sec. 6.5 on panel data (as a probable fifth example) and the treatment of MML shrinkage estimation in [Schmidt and Makalic, 2009b]. 5 DESIDERATA: STATISTICAL INVARIANCE, STATISTICAL CONSISTENCY, EFFICIENCY, SMALL-SAMPLE PERFORMANCE, ETC. In this section, we look at several desiderata — or properties that we might desire — from statistical estimators.
5.1
Statistical invariance
Statistical invariance [Wallace, 2005, sec. 5.2; Dowe et al., 2007, sec. 5.3.2; Dowe, 2008b, p. 435] says, informally, that we get the same answer no matter how we phrase the problem.
934
David L. Dowe
So, if we know that the relationship between p the area A of a circle and its radius r is given by A = πr2 (and, equivalently, r = A/π), then statistical invariance requires that our estimate of the area is π times our the square of our estimate of the radius. The estimator function is often denoted by a hat (or circumflex),ˆ, above. So, for a circle, statistical invariance in the estimator would require that Aˆ = πˆ r2 . If we replace r by p κ in the Cartesian and polar co-ordinates example from sec. 3.6, (κ = sign(x) . x2 + y 2 , θ = tan−1 (y/x)) and (x = κ cos θ, y = κ sin θ). If we are estimating the strength and direction (κ, θ) of a magnetic field or equivalently the x and y co-ordinates (x, y) [Wallace and Dowe, 1993], then statistical invariance requires that x ˆ=κ ˆ cos θ, yˆ = κ ˆ sin θ. Statistical invariance is surely an aesthetic property of an estimate. In many problems, we are not committed to only one parameterisation - and, in those cases, statistical invariance is more useful than a simple aesthetic nicety. Maximum Likelihood, Akaike’s Information Criterion, Strict Minimum Message Length (SMML) [Wallace and Boulton, 1975; Wallace, 2005, chap. 3] and many MML approximations [Wallace and Freeman, 1987; Wallace, 2005, chaps. 4–5; Dowe 2008a, sec. 0.2.2 and footnote 159; Schmidt, 2008; Dowe, 2008b, p. 438 and p. 451] are statistically invariant, but there do exist approaches — such as the Bayesian Maximum A Posteriori (MAP) approach [Oliver and Baxter, 1994; Dowe et al., 1996e; Wallace and Dowe, 1999b, secs. 1.2–1.3; 1999c, sec. 2, col. 1; 2000, secs. 2 and 6.1; Comley and Dowe, 2005, sec. 11.3.1; Dowe et al., 2007, sec. 5.1, coding prior; Dowe 2008a, footnote 158; 2008b, p. 443 and pp. 448–449] — which are not statistically invariant.
5.2
Statistical consistency
For those who like collecting larger and larger data sets in the hope and belief that this will bring us closer and closer to whatever model or process underlies the data, statistical consistency — the notion of getting arbitrary close to any true underlying model given sufficient data [Dowe et al., 2007, secs. 5.3.4, 6.1.3 and later; Dowe 2008b, pp. 436–437] — is of paramount importance. More formally, we might write it as [Dowe, 2008b, p. 436] ˆ < ǫ) > 1 − ǫ ∀θ ∀ǫ > 0 ∃N0 ∀N ≥ N0 P r(|θ − θ| and we could even venture to write it (in a parameterisation-invariant way) as (e.g.) ˆ ˆ < ǫ) > 1 − ǫ. ∀θ ∀ǫ > 0 ∃N0 ∀N ≥ N0 P r(∆(θ||θ) = KL(θ, θ) Of course, as highlighted by Gr¨ unwald and Langford [2004; 2007], cases of model misspecification do occur. In other words, it might be that the true model θ (if there is one) is not contained in the family (or class) of models over which we ˆ In such cases, we can modify (or generalise) the notion conduct our search for θ.
MML, Hybrid Bayesian Network Graphical Models, ...
935
of statistical consistency to be that (as implicitly described in [Dowe, 2008a, sec. 0.2.5, p. 540, col. 1]), as we get more and more data, we get the Kullback-Leibler distance arbitrarily close to that of the closest available member in our model space. Or, more formally, ˆ ˆ < KL(θ, θˆbest ) + ǫ) > 1 − ǫ, ∀θ ∀ǫ > 0 ∃N0 ∀N ≥ N0 P r(∆(θ||θ) = KL(θ, θ) where θˆbest is as close as one can get in Kullback-Leibler distance to θ from within the space of models being considered. I should (and do now) qualify this slightly. It is possible that θˆbest might not exist in the same sense that there is no number in the list 1, 1/2, 1/3, 1/4, ... which is the list’s smallest element. One of a few ways of dealing with this is simply to replace the start of the above with “∀θ′ in our model space” and to replace the ˆ < KL(θ, θ′ ) + ǫ) > 1 − ǫ”, thus now making finish of the above with “P r(KL(θ, θ) it ∀θ′ in our model space
ˆ < KL(θ, θ′ ) + ǫ) > 1 − ǫ. ∀ǫ > 0 ∃N0 ∀N ≥ N0 P r(KL(θ, θ) (As a second point, replacing the first ǫ by ǫ/2 does not change the semantics of the definition. Similarly, replacing one of the ǫ terms by a δ and adding a quantifier ∀δ > 0 out front also does not change the semantics, as we can (e.g.) re-set ǫ′ = min{δ, ǫ}.) With an eye to secs. 6.4 and 6.5 and this issue of statistical consistency under misspecification, it is worth bearing in mind that — even though there is misspecification — the class (or family) of models over which we conduct our search might be dense in the space of possible models. In other words, if you have a non-negative valued function (or probability distribution) on the real line which integrates to 1 (and can’t be written as a finite mixture model), it can still be possible to find a sequence of finite (Gaussian) mixture models which fit it arbitrarily closely. (For reading on mixture models, see, e.g., [Jorgensen and McLachlan, 2008].)
5.3
Efficiency, small-sample performance, other considerations, etc.
The notion of efficiency is perhaps ambiguous in that it has been used in the literature with at least two different meanings. On the one hand, efficiency has been taken to mean that the message length calculations and approximations are both optimal or near-optimal (with li ≈ − log pi ) [Wallace, 2005, sec. 5.2.4]. On the other hand, efficiency of an estimator has been taken to mean the speed with which that estimator converges to the true model generating the data as the amount of data increases [Dowe et al., 2007, secs. 5.3.4 and 8]. While these two notions are different, it should be pointed out that, insofar as reliable MML coding schemes lead to good inferences and less reliable coding schemes lead to less reliable inferences [Quinlan and Rivest, 1989; Wallace and
936
David L. Dowe
Patrick, 1993, Kearns et al., 1997; Viswanathan et al., 1999; Wallace, 2005, sec. 7.3; Murphy and Pazzani, 1994; Needham and Dowe, 2001; Wallace and Dowe, 1999a, secs. 5.1 and 7; 1999c, sec. 2; Comley and Dowe, 2005, secs. 11.3 and 11.4.3; Wallace, 2005, sec. 5.1.2; Dowe, 2008a, footnote 18], the two notions are very related. As well as the notions of statistical invariance (from sec. 5.1), statistical consistency (from sec. 5.2) and efficiency, there are also issues of performing well on small-sample sizes [Dowe, 2008b, p. 436 and p. 456]. The issue of which likelihood function(s), sample size(s), parameterisation(s), Bayesian prior(s) and protocol(s) (or which parts of LNPPP-space) are important when comparing the efficacy of two estimators is discussed in [Dowe, 2008a, sec. 0.2.7, pp. 543–544]. 6
MINIMUM MESSAGE LENGTH (MML) AND STRICT MML
As in sec. 1, historically, the seminal Wallace and Boulton paper [1968] came into being from Wallace’s and Boulton’s finding that the Bayesian position that Wallace advocated and the information-theoretic (conciseness) position that Boulton advocated turned out to be equivalent [Wallace, 2005, preface, p. v; Dowe, 2008a, sec. 0.3, p. 546 and footnote 213]. After several more MML writings [Boulton and Wallace, 1969; 1970, p. 64, col. 1; Boulton, 1970; Boulton and Wallace, 1973b, sec. 1, col. 1; 1973c; 1975, sec. 1, col. 1] (and an application paper [Pilowsky et al., 1969], and at about the same time as David Boulton’s PhD thesis [Boulton, 1975]), their paper [Wallace and Boulton, 1975, sec. 3] again emphasises the equivalence of the probabilistic and information-theoretic approaches. (Different but not unrelated histories are given by Solomonoff [1997a] and a review of much later work by Kontoyiannis [2008]. For those interested in the formative thinking of Wallace (and Boulton) leading up to the seminal Wallace and Boulton MML paper [1968], see evidence of the young Bayesian (but pre-MML) Wallace in his mid-20s in the 1950s [Brennan, 2008, sec. 4; Brennan et al., 1958, Appendix] and see Wallace’s accounts of his early discussions with David Boulton [Wallace, 2005, preface, p. v; Dowe, 2008a, sec. 0.3, p. 546, col. 2 and footnote 213 (and sec. 1)] which resulted in [Wallace and Boulton, 1968]. If you can obtain it, then I also commend [Wallace, 1992] for background.) As in sec. 1 and following the principles of information theory from sec. 2.1, given data D, we wish to choose a hypothesis H so as to minimise the length of a two-part message conveying H (in part 1) followed (in part 2) by D given H. The length of this message is − log P r(H) − log P r(D|H). A one-part form of the message was examined in [Boulton and Wallace, 1969], but various pieces of theory and practice (e.g., [Barron and Cover, 1991]) point to the merits of the two-part form of the message. We now point to the Strict Minimum Message Length (SMML) formulation from Wallace and Boulton [1975] in sec. 6.1, and then go on to talk about some
MML, Hybrid Bayesian Network Graphical Models, ...
937
“MML” approximations to SMML, some conjectures about the possible uniqueness of Strict MML in being both statistically invariant and statistically consistent for certain classes of problems, and some applications of MML to a variety of problems in inference and other areas in science and philosophy.
6.1
Strict MML (SMML)
The Strict Minimum Message Length (SMML) formulation from Wallace and Boulton [Wallace and Boulton, 1975; Wallace and Freeman, 1987; Wallace, 1996c; Dowe et al., 1998; Wallace and Dowe, 1999a; 1999b; Farr and Wallace, 2002; Fitzgibbon et al., 2002b, Fitzgibbon, 2004; Agusta, 2005; Wallace, 2005, chap. 3; Comley and Dowe, 2005, sec. 11.2; Dowe et al., 2007; Dowe, 2008a, footnotes 12, 153, 158 and 196, and sec. 0.2.2] shows how to generate a code-book whose expected two-part message length is minimised, but this turns out to be computationally intractable except in the simplest of cases — such as the binomial distribution [Farr and Wallace, 2002; Wallace, 2005, chap. 3]. Of historical interest is the fact [Dowe, 2008a, sec. 0.1, p. 524, col. 1] that, even though MML had been in print many times over since 1968 [Wallace and Boulton, 1968, p. 185, sec. 2; Boulton and Wallace, 1969; 1970, p. 64, col. 1; Boulton, 1970; Boulton and Wallace, 1973b, sec. 1, col. 1; 1973c; 1975, sec. 1, col. 1; Boulton, 1975], referees delayed the publication of Strict MML until Wallace and Boulton [1975]. Strict MML (SMML) partitions in data-space and optimises a formula of the form
(−
X j
(qj log qj ))
+
(−
XX j
i∈cj
(qj .
r(xi ) . log f (xi |θj ))) qj
(18)
Note here first that i indexes over the data. This set must be countable, as all recorded measurements are truncated and recorded to some finite accuracy. (See [Dowe, 2008a, footnote 63] for a discussion of consequences of attempting to sidestep such an insistence.) This point established, we now assign data to groups indexed by j. The number of groups will certainly be countable (and to date I am not aware of any cases where there are infinitely many groups). R Letting h(·) be the prior and f (·|·) denote the statistical likelihood, r(xi ) = h(θ)f (xi |θ) dθ is the marginal probability i ) is a probability and not P of datum xi . Note that r(xP a density, and also that i r(xi ) = 1. The term qj = i∈cj r(xi ) is the amount of prior probability associated with the data group cj . The groups cj form a partition of the data, with P each datum being assigned to exactly one group — from which it follows that j qj = 1. For each data Pgroup cj , we choose the estimate θj which maximises the weighted log-likelihood i∈cj r(xi ) log f (xi |θj ). As we have written equation (18), the first term is the expected length of encoding the hypothesis (see, e.g., sec. 2.3) and the second term is the expected length
938
David L. Dowe
of encoding the data given this hypothesis — namely the hypothesis that datum xi lies in group cj with (prior) probability qj and estimate θj . The computational intractability of Strict MML (except in the simplest of cases — such as the binomial distribution [Farr and Wallace, 2002; Wallace, 2005, chap. 3]) is largely due to its discrete nature — or its being “gritty” (as Chris Wallace once put it) — requiring shuffling of data between data groups, then re-estimating the qj and θj for each data group cj , and then re-calculating the message length. The code-book with the shortest expected message length as per equation (18) is the SMML code-book, and the SMML estimator for each datum xi is the θj corresponding to the group cj to which xi is assigned.
6.2
Strict Strict MML (SSMML)
Recall the notion of (algorithmic information theory or) Kolmogorov complexity from sec. 2.4 and sec. 1. It could be said that the relationship between Strict MML and Kolmogorov complexity [Wallace and Dowe, 1999a; Wallace, 2005, chaps. 2–3] might be slightly enhanced if we turn the negative logarithms of the probabilities from equation (18) into integer code lengths — such as would seem to be required for constructing a Huffman code from sec. 2.1 (or other fully-fledged kosher code). From [Wallace, 2005, sec. 3.4, p. 191] and earlier writings (e.g., [Wallace and Freeman, 1987]), it is clear that Wallace was aware of this issue but chose to neglect and not be distracted by it. Although it is hard to imagine it having anything other than the most minor effect on results, we take the liberty here of introducing here what I shall call Strict Strict MML (SSMML), where the constituent parts of both the first part (currently, for each j, of length − log qj ) and the second part (currently, for each j, for each i, of length − log f (xi |θj )) of the message have non-negative integer lengths. One reason for preferring Strict MML to Strict Strict MML is that, as can be seen from inspecting equation (18), the Strict MML data groups, estimates and code-book will all be independent of the base of logarithm — be it 2, 10, e or whatever — and the (expected) message length will transform in the obvious invariant way with a change of base of logarithms. However, Strict Strict MML will require an integer greater than or equal to 2 to be the base of logarithms, and will not be independent of this choice of base. The simplest response to this objection is to insist that the base of logarithms is always 2 for Strict Strict MML. My guess is that Strict Strict MML (with base of logarithms set to 2, although any larger positive integer base should work both fine and similarly as the amount of data increases) will typically be very similar to Strict MML. By construction, Strict Strict MML (with fixed base of logarithms, such as 2) will necessarily be statistically invariant, and Strict Strict MML will presumably share statistical consistency and other desirable properties of Strict MML. There is another issue which arises when relating Strict MML (from sec. 6.1) and Strict Strict MML to Kolmogorov complexity (or algorithmic information
MML, Hybrid Bayesian Network Graphical Models, ...
939
theory). As per equation (18) and associated discussion(s), both Strict MML and Strict Strict MML require the calculation of the marginal probability, r(xi ) = R h(θ)f (xi |θ) dθ, of each datum, xi . As in sec. 6.1, these marginal probabilities are then used to calculate what we call the “coding prior” [Dowe et al., 2007, sec. 5.1], namely the discrete P set of possible estimates {θj } and their associated prior probabilities, {qj }, with j qj = 1. (Strict MML is then equivalent to using the coding prior in combination with the given statistical likelihood function and doing conventional Bayesian Maximum A Posteriori (MAP).) As per [Wallace and Dowe, 1999a] and [Wallace, 2005, chaps. 2–3], the input to the Turing machine will be of a two-part form such that the first part of this input message (which conveys the hypothesis, theory or model) programs the Turing machine (without any output being written). The second part is then input to the program resulting from the first part of the message, and this input causes the desired Data to be output. (In the example with noise in sec. 7.2, the first part of the message would encode the program together with an estimate of the noise, and the second part would encode the data with code lengths depending upon the probabilities as per sec. 2.1.) The particular additional issue which arises when relating Strict MML and Strict Strict MML to Kolmogorov complexity (or algorithmic information theory) occurs when dealing with universal Turing machines (UTMs) and the Halting problem (Entscheidungsproblem) — namely, we can get lower bounds on the marginal probability (r(xi )) of the various data (xi ) but, due to the Halting problem, typically for at least many values of xi we will not be able to calculate r(xi ) exactly but rather only give a lower bound. If the Turing machine (TM) representing our prior is not universal (e.g., if we restrict ourselves to the family of multivariate polynomials with one of the Bayesian priors typically used in such a case), then we can calculate r(xi ) to arbitrary precision for each xi . But if the TM representing our prior is a UTM, then we might have to live with only having ever-improving lower bounds on each of the r(xi ). If we stop this process after some finite amount of time, then we should note that the coding prior corresponding to the grouping arising from Strict MML (and ditto from Strict Strict MML) would appear to have the potential to be different from the prior emanating from our original UTM. That said, if we don’t go to the trouble of summing different terms contributed from different programs in the calculation of r(xi ) but rather simply take the largest available such term, then we quite possibly get something very similar or identical to our intuitive notion of a two-part Kolmogorov complexity. Finally, it is worth changing tack slightly and adding here that Strict MML is a function of the sufficient statistics in the data [Wallace, 2005, sec. 3.2.6], as also should be Strict Strict MML. When some authors talk of the Kolmogorov sufficient statistics, it is as though they sometimes forget or are unaware that sometimes — such as for the Student t distribution or the restricted cut-point segmentation problem from [Fitzgibbon et al., 2002b] — the minimal sufficient statistic can be the entire data set [Comley and Dowe, 2005, sec. 11.3.3, p. 270].
940
6.3
David L. Dowe
Some MML approximations and some properties
Given the typical computational intractability of Strict MML from sec. 6.1 (which would only be worse for Strict Strict MML from sec. 6.2), it is customary to use approximations. Given data, D, the MMLD (or I1D ) approximation [Dowe, 2008a, sec. 0.2.2; Fitzgibbon et al., 2002a; Wallace, 2005, secs. 4.10 and 4.12.2 and chap. 8, p. 360; Dowe, 2008b, p. 451, eqn (4)] seeks a region R which minimises R Z ~ log f (D| ~ dθ ~ θ) h(θ). ~ − log( h(θ) dθ) − R R (19) ~ dθ h(θ) R R
The length of the first part is the negative log of the probability mass inside the region, R. The length of the second part is the (prior-weighted) average over the region R of the log-likelihood of the data, D. An earlier approximation similar in motivation which actually inspired Dowe’s MMLD approximation from eqn (19) above is the Wallace-Freeman approximation [Wallace and Dowe, 1999a, sec. 6.1.2; Wallace, 2005, chap. 5; Dowe, 2008b, p. 451, eqn (5)]
=
~ q − log(h(θ).
1 ~ κdd Fisher(θ)
)
~ +d − log f (~x|θ) 2
~ + L + (1/2) log(Fisher(θ)) ~ + (d/2)(1 + log(κd )) (20) − log(h(θ))
which was first published in the statistics literature [Wallace and Freeman, 1987]. ~ in equation (20) (Digressing, note that if one approximates − log(1/Fisher(θ)) very crudely as k log N , then equation (20) reduces to something essentially equivalent to Schwarz’s Bayesian Information Criterion (BIC) [Schwarz, 1978] and Rissanen’s original 1978 version of Minimum Description Length (MDL) [Rissanen, 1978], although it can be strongly argued [Wallace and Dowe, 1999a, sec. 7, p. ~ term from equation (20) is best not idly 280, col. 2] that the − log(1/Fisher(θ)) approximated away.) A very recent approximation certainly showing promise is due to Schmidt [2008] and in [Dowe, 2008a, footnotes 64–65]. This MMLFS estimator [Schmidt, 2008], upon close examination, would appear to be based on an idea in [Fitzgibbon et al., 2002a, sec. 2, especially equation (7)] (which in turn uses Wallace’s FSMML Boundary rule as from [Wallace, 1998e]) and [Fitzgibbon et al., 2002b, sec. 4] (again using Wallace’s FSMML Boundary rule as from [Wallace, 1998e] and [Fitzgibbon et al., 2002b, sec. 3.2], but see also [Wallace, 2005, sec. 4.11]). My MMLD estimator from equation (19) gives the message length (MsgLen) for a region. The MMLFS estimator just mentioned gives MsgLen for a point (as does the Wallace-Freeman [1987] estimator from equation (20)). Both MMLD and MMLFS are calculated using Markov Chain Monte Carlo (MCMC) methods. These approximations above, together with the Dowe-Wallace Ideal Group (or IG) estimator [Wallace, 2005, secs. 4.1, 4.3 and 4.9; Agusta, 2005, sec. 3.3.3,
MML, Hybrid Bayesian Network Graphical Models, ...
941
pp. 60–62; Fitzgibbon, 2004, sec. 5.2, p. 70, footnote 1; Dowe, 2008a, p. 529, col. 1 and footnote 62] and other estimators (e.g., TAIG) discussed in [Dowe, 2008a, footnotes 62–65] are all statistically invariant. Recalling the notion of statistical consistency from sec. 5.2, we now show (in secs. 6.4 and 6.5) that MML is statistically consistent where a variety of other estimators fail — either under-estimating (which is typical of most alternatives to MML) or over-estimating (which is typical of the Bayesian minEKL estimator) the degree of noise.
6.4
Neyman-Scott problem and statistical consistency
In the Neyman-Scott problem [Neyman and Scott, 1948; Dowe, 2008b, p. 453], we measure N people’s heights J times each (say J = 2) and then infer 1. the heights µ1 , ..., µN of each of the N people, 2. the accuracy (σ) of the measuring instrument. We have JN measurements from which we need to estimate N + 1 parameters. JN/(N + 1) ≤ J, so the amount of data per parameter is bounded above (by J). 2 2 It turns out that σ ˆMaximumLikelihood → J−1 J σ , and so for fixed J as N → ∞ we have that Maximum Likelihood is statistically inconsistent — under-estimating σ [Neyman and Scott, 1948] and “finding” patterns that aren’t there. The Bayesian Maximum A Posteriori (MAP) approach (from sec. 5.1) is likewise not statistically consistent here [Dowe, 2008a, footnote 158]. Curiously, the Bayesian minimum expected Kullback-Leibler distance (minEKL) estimator [Dowe et al., 1998; Wallace, 2005, secs. 4.7–4.9; Dowe et al., 2007, secs. 4 and 6.1.4; Dowe, 2008a, sec. 0.2.2; 2008b, p. 444] from sec. 4.1 is also statistically inconsistent for the Neyman-Scott problem, conservatively over-estimating σ [Wallace, 2005, sec. 4.8]. Recall a discussion of this (the Neyman-Scott problem) and of the “gappy” (“gap or no gap”) problem in sec. 4.1. However, the Wallace-Freeman MML estimator from equation (20) and the Dowe-Wallace Ideal Group (IG) estimator have both been shown to be statistically consistent for the Neyman-Scott problem [Dowe and Wallace, 1996; 1997a; 1997b; Wallace, 2005, secs. 4.2–4.5 and 4.8; Dowe et al., 2007, secs. 6.1.3–6.1.4; Dowe, 2008a, secs. 0.2.3 and 0.2.5; 2008b, p. 453]. An interesting discussion of the intuition behind these results is given in [Wallace, 2005, sec. 4.9]. We now use MML to re-visit a Neyman-Scott(-like) panel data problem from [Lancaster, 2002], as hinted at in [Dowe, 2008a, sec. 0.2.3, footnote 88].
6.5
Neyman-Scott panel data problem (from Lancaster)
Following the concise discussion in [Dowe, 2008a, sec. 0.2.3, footnote 88], we use MML here to re-visit the panel data problem from [Lancaster, 2002, 2.2 Example 2, pp. 651–652].
942
David L. Dowe
yi,t = fi + xi,t β + ui,t
(i = 1, ..., N ; t = 1, ..., T )
(21)
where the ui,t are independently Normal(0, σ 2 ) conditional on the regressor sequence, fi , and θ = (β, σ 2 ). We can write the (negative) log-likelihood as L =
N T NT NT 1 XX (yi,t − fi − xi,t β)2 log 2π + log σ 2 + 2 2 2 σ i=1 t=1
(22)
Using the Wallace-Freeman approximation [Wallace and Dowe, 1999a, sec. 6.1.2; Wallace, 2005, chap. 5; Dowe, 2008b, p. 451, eqn (5)] from equation (20), we require a Bayesian prior (which does not have a great effect but which does, among other things, keep the estimator statistically invariant) and the determinant of the expected Fisher information matrix of expected second-order partial derivatives (with respect to the fi [i = 1, ..., N ], β and σ 2 ). Before taking the expectations, let us first take the second-order partial derivatives — starting with the diagonal terms. T ∂L 1 X (yi,t − fi − xi,t β) , = − 2 ∂fi σ t=1
and
∂2L = T /(σ 2 ) ∂fi 2
(23)
Not dissimilarly, T 1 X ∂L xi,t (yi,t − fi − xi,t β), =− 2 ∂β σ t=1
and
N T ∂2L 1 XX xi,t 2 = ∂β 2 σ 2 i=1 t=1
N T ∂L NT 1 XX (yi,t − fi − xi,t β)2 , = − ∂(σ 2 ) 2σ 2 2(σ 2 )2 i=1 t=1
and
N T NT 1 XX ∂2L (yi,t − fi − xi,t β)2 = − + ∂(σ 2 )2 2(σ 2 )2 (σ 2 )3 i=1 t=1
(24)
(25)
(26)
Still looking at the second derivatives, let us now look at the off-diagonal terms and then return to take expectations. T ∂2L ∂ 1 X ∂2L (yi,t − fi − xi,t β)) = 0 = = (− 2 ∂fi ∂fj ∂fj ∂fi ∂fj σ t=1
(for i 6= j)
(27)
MML, Hybrid Bayesian Network Graphical Models, ...
T ∂2L ∂2L 1 X xi,t = = 2 ∂fi ∂β ∂β∂fi σ t=1 T ∂2L 1 X ∂2L (yi,t − fi − xi,t β) = = ∂fi ∂(σ 2 ) ∂(σ 2 )∂fi 2(σ 2 )2 t=1 N T ∂2L ∂2L 1 XX xi,t (yi,t − fi − xi,t β) = = ∂β∂(σ 2 ) ∂(σ 2 )∂β (σ 2 )2 i=1 t=1
943
(28)
(29)
(30)
Now taking expectations to get the terms contributing to the determinant of the expected Fisher information matrix, namely the expected Fisher information, let us first use equation (27) (dealing with the off-diagonal cases i 6= j) and equation (23) (dealing with the diagonal cases i = j) to give
E(
N N N Y Y Y ∂2L ∂2L ∂2L T /(σ 2 ) = T /((σ 2 )N ) ) = = E( ) = 2 2 ∂f 2 ∂f ∂f i i i=1 i=1 i=1
(31)
Re-visiting equation (29), we have that E(
∂2L ) = 0 ∂fi ∂(σ 2 )
(for i = 1, ..., N )
(32)
Equation (28) gives us a term proportional to 1/(σ 2 ), namely E(
T ∂2L 1 X E(xi,t ) , ) = 2 ∂fi ∂β σ t=1
(33)
and equation (26) gives us E(
∂2L ) ∂(σ 2 )2
= =
NT 1 + 2 3 N T E((yi,t − fi − xi,t β)2 ) 2(σ 2 )2 (σ ) N T σ2 NT NT + = − 2(σ 2 )2 (σ 2 )3 2(σ 2 )2 −
(34)
From [Lancaster, 2002, p. 651, 2.2 Example 2, equation (2.8)] and our equation (30), we have
E(
∂2L ) = 0 ∂β∂(σ 2 )
(35)
944
David L. Dowe
Looking at the (N + 2) × (N + 2) expected Fisher information matrix, we first note that the only non-zero entry in the σ 2 column is also the only non-zero entry in σ 2 row, namely that from equation (34). Looking at the rest of the matrix, namely the (N +1)×(N +1) sub-matrix in the top left, we see that the only non-zero off-diagonal terms are the E(∂L/(∂fi ∂β)) terms from equation (33) in the row and column corresponding to β. Looking at equations (31) and (33), we see that these few off-diagonal terms from equation (33) and all the diagonal terms are of the form Const1 /(σ 2 )2 . Combining this with equation (34), we see that the Fisher information is given by Const2 × (1/(σ 2 )N +1 ) × (N T )/(2(σ 2 )2 ) = Const/((σ 2 )N +3 ). Anticipating what we need for the Wallace-Freeman (1987) MML approximation ~ and equation (22) then give that in equation (20), this expression for Fisher(θ) ~ L + (log Fisher(θ))/2 =
=
N T NT NT 1 XX 2 (yi,t − fi − xi,t β)2 log 2π + log σ + 2 2 2 σ i=1 t=1
N −3 1 log σ 2 ) + ( log(Const) − 2 2 1 (N − 1)T − 2 log((2π)N T Const) + log(σ 2 ) 2 2 N T 1 XX (yi,t − fi − xi,t β)2 + 2 σ i=1 t=1
(36)
Re-capping, leaving aside the Bayesian priors (which will give statistical invariance to the MML estimator) and some constant terms, we see that the WallaceFreeman MML approximation gives us what [Lancaster, 2002, p. 652] calls “the “correct” degrees of freedom, N (T − 1), apart from a negligible term”. As per [Dowe, 2008a, sec. 0.2.3, footnote 88], this can be extended to also deal with the subsequent panel data problem from [Lancaster, 2002, 2.3 Example 2, pp. 652-653]. Having made these points about statistical invariance and statistical consistency of MML, we perhaps digress slightly and note that Gr¨ unwald and Langford [2004; 2007] have shown statistical inconsistency under model misspecification for various Bayesian estimators and various forms of the Minimum Description Length (MDL) principle, but we are not aware of any current evidence for a statistical inconsistency in MML [Gr¨ unwald and Langford, 2007, sec. 7.1.5; Dowe, 2008a, sec. 0.2.5]. The above — and other — evidence and experience has led to the following conjectures. The first two conjectures deal with the case that there is a true model in the space of models being examined, and the subsequent conjectures deal with the case of model misspecification.
MML, Hybrid Bayesian Network Graphical Models, ...
945
Conjecture 1 [Dowe et al., 1998, p. 93; Edwards and Dowe, 1998, sec. 5.3; Wallace and Dowe, 1999a, p. 282; 2000, sec. 5; Comley and Dowe, 2005, sec. 11.3.1, p. 269; Dowe, 2008a, sec. 0.2.5, pp. 539–540; 2008b, p. 454]: Only MML and very closely-related Bayesian methods are in general both statistically consistent and invariant. (This first conjecture was once the subject of a panel discussion at a statistics conference [Dowe et al., 1998a].) Conjecture 2 (Back-up Conjecture) [Dowe et al., 2007, sec. 8; Dowe, 2008a, sec. 0.2.5; 2008b, p. 454]: If there are (hypothetically) any such non-Bayesian methods, they will be far less efficient than MML. Re the issue of statistical consistency under model misspecification as per sec. 5.2, first, suppose that the space where we conduct our search is dense in the space from which the true model comes (e.g., suppose the true model space is that of friendly non-Gaussian t distributions and our search space is the space of finite mixtures of Gaussian distributions). Then, in this case, if our inference method is statistically consistent when the true model comes from the search space (i.e., in this example, if our inference method is statistically consistent within the space of finite Gaussian mixture models) then we would expect our inference method to still be statistically consistent for the misspecified true model from the larger class (i.e., in this example, we would expect our inference method to remain statistically consistent when the true model is a friendly non-Gaussian t distribution from [“just”] outside our search space). Paraphrasing, if it’s consistent in the search space, it can get arbitrarily close within the search space — and, if the search space is dense in the true space, then it would appear that we can get arbitrarily close to something arbitrarily close, seemingly implying statistical consistency. Still on this issue of statistical consistency under model misspecification from sec. 5.2, we know that MML will be statistically invariant and we further conjecture [Dowe, 2008a, sec. 0.2.5, especially p. 540] that MML will still — in this more challenging setting — be statistically consistent. If there are (hypothetically) any non-Bayesian methods which are statistically consistent in this setting, then we further conjecture that they will be less efficient than MML. Note throughout that the statistical consistency is coming from the informationtheoretic properties of MML and the statistical invariance is coming from the Bayesian priors. If there is any truth to these conjectures — and I am yet to see anything I could legitimately call a counter-example — then it would seem to suggest that inference done properly must inherently be Bayesian. I make this claim because • recalling sec. 5.1, statistical invariance says that we get the same answer whether we model in (e.g.) polar or Cartesian co-ordinates, or in (e.g.) side length, face area or volume of a cube, etc.
946
David L. Dowe
• recalling sec. 5.2, statistical consistency — whether for a properly specified model or a misspecified model — merely says that collecting extra data (as people seem very inclined to do) is a worthwhile activity. For other related arguments in support of Bayesianism and the Bayesian MML approach, recall Wallace’s notions of Bayesian bias [Wallace, 1996c, sec. 4.1] and false oracles [Wallace, 1996c, sec. 3; Dowe et al., 2007], Wallace’s intuitive but nonetheless important proof that sampling from the Bayesian posterior is a false oracle [Wallace, 1996c, sec. 3.4] and Wallace’s arguments that the Strict MML estimator (which is deterministic) approximates a false oracle [Wallace, 1996c, secs. 5–7]. For different arguments in support of Bayesianism and the Bayesian MML approach for those who like the notions of Turing machines and (Kolmogorov complexity or) algorithmic information theory from sec. 2.4, recall that (as per the end of sec. 2.4) the choice of (Universal) Turing Machine in algorithmic information theory is (obviously?) also a Bayesian choice [Wallace and Dowe, 1999a, secs. 2.4 and 7; 1999c, secs. 1–2; Comley and Dowe, 2005, p. 269, sec. 11.3.2; Dowe, 2008a, footnotes 211, 225 and (start of) 133, and sec. 0.2.7, p. 546; 2008b, p. 450].
6.6
Further MML work, such as MML Support Vector Machines
The relationship between MML and Kolmogorov complexity (or algorithmic information theory) [Wallace and Dowe, 1999a; Wallace, 2005, secs. 2.2–2.3] from sec. 2.4 means that MML can be applied universally across all inference problems and even compare and contrast two models from very different families. Let us saying something about the statistical learning theory of Vapnik and Chervonenkis, VC dimension, Support Vector Machines (SVMs) and Structural Risk Minimisation (SRM) [Vapnik, 1995] before discussing how this might be put in an MML framework. The statistical learning theory of Vapnik and Chervonenkis uses the notion of the Vapnik-Chervonenkis (VC) dimension of a set to give a classical nonBayesian way of doing regression (typically using a technique called Structural Risk Minimisation [SRM]) and classification (typically using Support Vector Machines [SVMs]). Recalling the distinction between inference and prediction from sec. 4.1, both statistical learning theory and Akaike’s Information Criterion (AIC) seem to be motivated by prediction, whereas MML is motivated by inference — a point noted in [Wallace, 1997] (ref. [281] in [Dowe, 2008a]). It is not clear to this writer what statistical learning theory advocates for (comparing) models from different families (e.g., polynomial vs exponential), or for decision trees (classification trees) (e.g., [Wallace and Patrick, 1993; Wallace, 2005, sec. 7.2]), where each split corresponds to a conjunctive “AND”, discretely partitioning the data. It is less clear what statistical learning theory will advocate for the generalisation of decision trees (classification trees) called decision graphs (classification graphs) [Oliver and Wallace, 1991; 1992; Oliver, 1993; Tan and Dowe, 2002; 2003], in which a disjunctive “OR” in the formula allows selected branches of the tree to join — making the model space more general, as now we have two
MML, Hybrid Bayesian Network Graphical Models, ...
947
discrete operators (both split and join) in addition to various continuous-valued parameters (such as the multinomial class probabilities in the leaves). Problems where this writer is either not sure what statistical learning theory will advocate and/or where I suspect that it might advocate (and have to bear the statistical inconsistency of) Maximum Likelihood include those mentioned above and also the Neyman-Scott and Neyman-Scott panel data problems from secs. 6.4 — 6.5 and the “gappy” problem and the bus number problem of sec. 4.1. Whether or not my ignorance of how statistical learning theory will behave in certain situations is a shortcoming in statistical learning theory or in me is something for the reader to decide. Meanwhile, MML is universal — due to its relationship with Kolmogorov complexity (as per secs. 1 and 6.2) and likewise because it always has a message length as an objective function. The original treatments (e.g., [Vapnik, 1995]) of the Vapnik-Chervonenkis (VC) notion of statistical learning theory, support vector machines (SVMs) and structural risk minimisation (SRM) are not Bayesian. Efforts have been made to put the notions of Vapnik-Chervonenkis statistical learning theory into a Bayesian MML (or similar) framework (starting with [Vapnik, 1995, sec. 4.6]). At least one motivation for doing this is to be able to apply statistical learning theory to problems where it might not have otherwise been possible to do so. Fleshing out ideas hinted at in [Vapnik, 1995, sec. 4.6], MML has been applied to Support Vector Machines (SVMs) in [Tan and Dowe, 2004] (where we do not just have SVMs, but we also have decision trees — and, in fact, we have a hybrid model with SVMs in the leaves of decision trees), with discussions on alternative and refined coding schemes given in [Dowe, 2007; 2008a, footnote 53; 2008b, p. 444] including [Dowe, 2008a, footnote 53, fourth way, pp. 527–528; 2008b, p. 444] explicitly modelling the distribution of all the variables, including the input variables. It is MML’s abovementioned relationship with Kolmogorov complexity (or algorithmic information theory) that enables us to consider alternative coding schemes. Explicitly modelling the distribution of all the variables (including the input variables) would amount to making generalized hybrid Bayesian network graphical models (as per sec. 7.6), some of whose properties are discussed in secs. 2.3 and and 3.6. (Perhaps digressing, as per [Dowe, 2008a, footnote 56], [Rubinstein et al., 2007] might also be of some use here.) Staying with the Vapnik-Chervonenkis VC dimension but moving from SVMs to SRM, MML was compared with SRM for univariate polynomial regression in [Wallace, 1997]. See the discussion in [Dowe, 2008a, sec. 0.2.2., p. 528, col. 1, including also footnotes 57 and 58]. MML has also been applied (e.g.) to hierarchical classification in [Boulton and Wallace, 1973b; Dowe, 2008a, sec. 0.2.3, p. 531, col. 1 and sec. 0.2.4, p. 537, col. 2] (and elsewhere), with an application of hierarchical MML mixture modelling in [Wallace and Dale, 2005], to image recognition in [Torsello and Dowe, 2008b; 2008a], with other work on MML mixture modelling in sec. 7.6, and MML applied to James-Stein estimation in [Schmidt and Makalic, 2009b]. For some of many more examples, see also (e.g.) sec. 7.6 (in particular) and (in general) all of sec. 7.
948
6.7
David L. Dowe
A note on Minimum Description Length (MDL)
I forget how many times and how regularly I have been asked to summarise and/or highlight the similarities and differences between MML and the much later Minimum Description Length (MDL) principle. Because of this and also because of a request from at least one and possibly all of my referees, I include the current section. For want of somewhere to put it, I have placed it here, but the reader can probably safely skip from sec. 6.6 to sec. 6.8 with probably greater continuity and perhaps no great loss. Historically, the Minimum Description Length (MDL) principle [Rissanen, 1978] (following formative ideas in [Rissanen, 1976]) was first published 10 years, 6 journal papers [Wallace and Boulton, 1968, p. 185, sec. 2; Boulton and Wallace, 1969; 1970, p. 64, col. 1; 1973b, sec. 1, col. 1; 1975, sec. 1, col. 1; Wallace and Boulton, 1975, sec. 3] (it would be 7 journal papers if we were hypothetically to count [Boulton and Wallace, 1973a]), 1 Master’s thesis [Boulton, 1970], at least one conference abstract [Boulton and Wallace, 1973c] and 1 PhD thesis [Boulton, 1975] after the seminal Wallace and Boulton MML paper [1968], including 3 years after the Wallace and Boulton [1975] paper introducing Strict MML (whose original publication was delayed as per sec. 6.1 and [Dowe, 2008a, sec. 0.1, p. 524, col. 1]). The ideas in MML of being Bayesian and of having a two-part message have been unwaveringly constant throughout since the original 1968 inception [Wallace and Boulton, 1968]. A variety of theoretical justifications for Bayesianism are given in (e.g.) sec. 6.5 and in [Wallace, 1996c, secs. 3 (especially 3.4), 4.1 and 5–7; Wallace and Dowe, 1999a, secs. 2.4 and 7; 1999c, secs. 1–2; Comley and Dowe, 2005, p. 269, sec. 11.3.2; Dowe et al., 2007; Dowe, 2008a, footnotes 211, 225 and (start of) 133, and sec. 0.2.7, p. 546; 2008b, p. 450]. A variety of theoretical justifications for the two-part form of the MML message are given in (e.g.) [Wallace and Freeman, 1987, p. 241; Barron and Cover, 1991; Wallace, 2005, sec. 3.4.5, p. 190, note use of “agrees”; Dowe et al., 2007, sec. 5.3.4]. The objectives — or at the least the way(s) of attempting to achieve the objectives — of the Minimum Description Length (MDL) principle would appear to have changed over the years since the first MDL paper in 1978 [Rissanen, 1978], where part of the motivation appears [Rissanen, 1978, p. 465] to be (algorithmic information theory or) Kolmogorov complexity, a term repeated in [Rissanen, 1999a, sec. 2, p. 261]. It is the prerogative of any scientist or any researcher to change and/or refine their ideas, and I attempt to survey various developments and changes in the presentations I have seen of MDL. Rissanen appears throughout his MDL works to want to avoid being Bayesian. This seems slightly curious to me for a few reasons. First, there are countably infinitely many Universal Turing Machines (UTMs) and, as per secs. 2.4 and 6.5, the choice of a UTM is a Bayesian choice. As such, in relating MDL to Kolmogorov complexity, it seems difficult not to relate MDL to Bayesianism. Second, although Rissanen does not seem to want to use a Bayesian prior, his Normalised Maximum Likelihood (NML) uses the Jeffreys “prior” [Rissanen, 1996; 1999a], an approach
MML, Hybrid Bayesian Network Graphical Models, ...
949
one might facetiously call “Bayesian”. The Jeffreys “prior” is not without issue — it is based on the data (thus apparently weakening the claimed relationship with Kolmogorov complexity), it doesn’t always normalise [Wallace and Dowe, 1999b, sec. 2], and it will typically depend upon things which we would not expect to be overly relevant to our prior beliefs — namely, the strength and location of our measuring instruments [Dowe et al., 1996e, p. 217; Wallace and Dowe, 1999a, sec. 2.3.1; Comley and Dowe, 2005, sec. 11.4.3, p. 273]. There are also other concerns [Wallace and Freeman, 1987, sec. 1, p. 241; Wallace and Dowe, 1999a, sec. 5, p. 277, col. 2]. Efforts to normalise the Jeffreys prior by restricting its domain are said by other authors to be “unsatisfying” [Dawid, 1999, sec. 5, p. 325, col. 2] and would certainly appear to be reverting back to Bayesianism (rather than “Bayesianism”). Another opinion about the Bayesianism or otherwise in MDL is “... we see that Rissanen’s approach is not incompatible with a Bayesian approach” [Clarke, 1999, sec. 2, p. 338, col. 2]. And while discussing the Jeffreys “prior” and Normalised Maximum Likelihood (NML) approach(es) [Rissanen, 1996; 1999a], it is worth inviting the reader to compare with the approximately contemporaneous PIC (Phillips Information Criterion, Posterior Information Criterion) [Phillips and Ploberger, 1996] and the much earlier and very similar Wallace-Freeman approximation [Wallace, 1984a; Wallace and Freeman, 1987] from equation (20) of no later than 1987 — see also the discussion in [Wallace, 2005, sec. 10.2.1]. The MDL notion of ‘completing the code’ (or complete coding) [Rissanen, 1996; Gr¨ unwald et al., 1998, sec. 4] seems to break down for a variety of relatively simple cases [Wallace and Dowe, 1999b, secs. 1.2 and 2.3] and would appear to be in violation of contravening the convergence conditions of the two-part message form from which the results in [Barron and Cover, 1991] emanate, a variety of theoretical justifications for which are cited above. The latest versions of MDL seem to advocate using Normalised Maximum Likelihood (NML) to select the “model class” (see [Wallace and Dowe, 1999b, sec. 2.1] re issues of ambiguity here) and the order of the model but not to do the point estimation of the parameters. Given the issues with Maximum Likelihood of over-fitting and statistical inconsistency raised in secs. 4.1 and 5.2, we endorse the avoidance of Maximum Likelihood. But then Normalised Maximum Likelihood (NML) starts to look quite similar to the earlier Wallace-Freeman [1987] approximation for model order selection but without necessarily easily being able to advocate a point estimate. And, as in sec. 6.5, Gr¨ unwald and Langford [2004; 2007] have shown statistical inconsistency for various Bayesian estimators and various forms of the Minimum Description Length (MDL) principle (under model misspecification), but none of us are aware of any current evidence for a statistical inconsistency in MML [Gr¨ unwald and Langford, 2007, sec. 7.1.5; Dowe, 2008a, sec. 0.2.5]. (It is probably worth mentioning here an attack on MML [Gr¨ unwald et al., 1998] which was later retracted [Gr¨ unwald, 2007, sec. 17.4, An Apologetic Remark; Dowe, 2008a, sec. 0.2.4, p. 536].) As we near the conclusion of this sub-section, it is worth pointing out that many authors use MDL as a generic term for any MDL-like or MML-like coding
950
David L. Dowe
scheme based on some similar method. People should take as much care as they are able to here — [Comley and Dowe, 2005, sec. 11.4.3] gives plenty of examples of poor MDL-like coding schemes whose performances vastly improved when they were re-visited using MML [Wallace and Patrick, 1991; 1993, Viswanathan et al., 1999; Needham and Dowe, 2001]. (See also sec. 4.) One can make bad wine from good grapes, and a poor coding scheme will not do justice to the likes of MDL and MML. Despite the above challenges to and/or criticisms of much MDL work to date as it compares with earlier MML work, considering the issues which Rissanen raises — as he attempts to maintain all the niceties (of MML) while also attempting to avoid being Bayesian — and contemplating responses can certainly yield insights at the very least, and of course possibly much more. One such purported insight is described in part of sec. 7.1, a section in which objective Bayesianism is discussed. (Also worth mentioning here in passing is a way in which MDL could be re-visited stating parameter estimates “using whatever ‘code’ or representation was used in the presentation of the raw data” [Wallace and Dowe, 1999b, sec. 3, p. 336, col. 2].) And, finally, re comparing MDL and MML, as per the final sentence of this section, I refer the reader to [Wallace and Dowe, 1999c, abstract]. Given that when we take logarithms base 2 (log2 ) we typically refer to the unit as a bit, for some historical context on what to call the units when we take natural logarithms (base e, loge ), see (e.g.) [Hodges, 1983, pp. 196–197] for early names (e.g., ‘ban’), see (e.g.) [Boulton and Wallace, 1970, p. 63; Comley and Dowe, 2005, p. 271, sec. 11.4.1] re ‘nit’, and see (e.g.) much later MDL writings for the term ‘nat’. Other treatments of this topic of contrasting MDL and MML are given in (e.g.) [Wallace, 1999; Wallace and Dowe, 1999b, sec. 3; Wallace, 2005, sec. 10.2; Comley and Dowe, 2005, sec. 11.4.3, pp. 272–273; Baxter and Oliver, 1995] and — perhaps most especially in summary — [Wallace and Dowe, 1999c, abstract].
6.8
Comparing “Right”/“Wrong” and Probabilistic scores
The original idea behind the notion of boosting was to more heavily (penalise or) weight incorrect answers in a decision tree (or classification tree) so as to grow the tree and ultimately have less errors — that is, right/wrong errors. Sec. 3.1 showed us that “right”/“wrong” scoring is not invariant to re-framing of questions, and sec. 3.2 re-iterated some recent results on the uniqueness of log(arithm)-loss scoring in being invariant to the re-framing of questions. This said, before we examine boosting more closely in sec. 6.9, we might ask what a good “right”/“wrong” score tells us about the log(arithm)-loss score and vice versa. By rights, in a multiple-choice question of c choices, even if c >> 2, giving the correct answer a probability of just less than 0.5 can still result in a higher probability of at least 0.5 being given to an incorrect answer and so a “right”/“wrong” score of “wrong” (or 0). So, regardless of the number of choices, c, an incorrect answer guarantees a score of at least 1 bit. Correspondingly, a score of less than 1 bit
MML, Hybrid Bayesian Network Graphical Models, ...
951
guarantees a “right” answer. At the other extreme, it is possible that a probability of just over 1/c allocated to the correct answer will give us a ‘right”/“wrong” score of “right” (or 1). Correspondingly, a score of more than log(c) will surely give us a “right”/“wrong” score of “wrong” (or 0). So, if we have n multiple-choice questions of c1 , ..., ci , ..., cn options each, then a “right”/“wrong” score of 0 corresponds to a log(arithm)-loss cost of at least log(2n ) = n bits, and a “right”/“wrong” score Pn of (all correct) Qnn correct corresponds to a log(arithm)-loss cost of at most log(c ) = log( i i=1 i=1 ci ). So, slightly paradoxically, on a quiz of 1 ternary (3-valued) question, someone (with probabilities {0.498, 0.501, 0.001}) might get a wrong answer for a minimum total of 0 “right” with a log-loss penalty score of just over log(2) whereas someone else (with probabilities {0.334, 0.333, 0.333}) might get a correct answer for a maximum total of 1 “right” but with a worse log-loss penalty score of just under log(3). I put it to the reader that the person with the better log-loss score actually has a better claim to having been correct on this question than the person given a score of 1 “right”. And, of course, more emphatically, on a quiz of n questions, someone with a “right”/“wrong” score of 0 “right” might have a log(arithm)-loss penalty score of little over log(2n ) whereas someone who got (n − 1) out of n correct might have an arbitrarily large (or infinite) log(arithm)-loss penalty score by assigning an arbitrarily small (or zero) probability to the correct answer in the one question that this person got “wrong”. (And, similarly, as at least implicitly pointed out in sec. 3.2, one can use boosting to make all sorts of guesses and predictions in data which is just random noise, and although much damage could be done to the log(arithm)-loss penalty score, no damage will be done to the “right”/wrong” score.) We address this issue of getting good “right”/wrong” scores without unduly damaging the log(arithm)-loss penalty score again in sec. 6.9.
6.9
Boosting
One suggestion of Chris Wallace’s (in private communication) was that the right/ wrong predictive accuracy of MML decision trees could be improved by going to each leaf in turn and doing one additional split beyond the MML split. I understand that this was the motivation behind the subsequent [Oliver and Hand, 1994]. Of course, as per sec. 4.1, optimal prediction is given not just by using the MML tree, but by combining several trees — ideally as many as possible — together [Oliver and Hand, 1996; Tan and Dowe, 2006]. However, given our new apparent uniqueness results for probabilistic log(arithm)loss scoring from [Dowe, 2008a, footnote 175 (and 176); 2008b, pp. 437–438] and sec. 3.2, it perhaps makes more sense to carefully focus on improving the probabilistic log(arithm)-loss score. One option is to do MML inference but with the “boosting priors” from [Dowe, 2008a, sec. 0.2.6]. The idea behind these “boosting priors” is that, rather than fix our Beta/Dirichlet prior [Wallace, 2005, p. 47 and sec. 5.4] to have α = 1, we
952
David L. Dowe
can try “boosting priors”, whose rough form [Tan and√Dowe, 2006, sec. 3.4, p. √ or 598; Dowe,√2008a, sec. 0.2.6] on α could be, e.g., 3/(2 α(1 + α)4 ) (e−α/π )/(π α). The idea is simply to retain a mean of (approximately) 1 but to have a large spike near α = 0, which in turn increases our propensity to have pure classes. Another option is to re-visit boosting and to think of it not as in its original form of minimising the number of right/wrong errors but rather instead in the similar form of trying to optimise the expected predictive score. This modification to boosting is related to the original form of boosting in that each individual (right/wrong) mistake will typically correspond to a poor right/wrong (yes/no) score. The predictive score should not be done using the log-likelihood, but rather should be done using the minimum expected Kullback-Leibler (minEKL) probability estimate [Dowe et al., 1998; Wallace, 2005, secs. 4.7–4.9; Dowe et al., 2007, secs. 4 and 6.1.4; Dowe, 2008a, sec. 0.2.2, 2008b, p. 444] from sec. 4.1. In other words, if there are classes and a given leaf has counts c1 , ..., ci , ..., cm for the m classes Pm m and C = i=1 ci , Maximum Likelihood would advocate a probability estimate of pˆi MaxLhood = ci /C for each class. However, if we think of αi from a Dirichlet prior as denoting a “pre-count” in class i (before any data is actually counted), Pm then the probability in each class can be regarded as pˆi = (ci + αi )/(C + ( j=1 αi )). Of course, we can set αi = α for each class and then use the so-called “boosting priors” on α as per [Tan and Dowe, 2006, sec. 3.4, p. 598; Dowe, 2008a, sec. 0.2.6]. Let us finish with three further comments. First, by way of digression, an attempt to give “objective” ways of choosing α is given in sec. 7.1. Second, for those who wish to boost to improve the right/wrong score or simply wish to get a good to excellent right/wrong score, given the (apparent uniqueness of) invariance of the log-loss score, we make the simple recommendation that predictors that give good right/wrong scores be checked so that they also give a good log-loss score — this might involve moving extreme probabilities away from the extremities of 0 and 1 (such as can arise from using Maximum Likelihood). (Some possible estimators for doing this are given in, e.g., [Wallace, 2005, secs. 4.8 and 5.4]. A prediction method which is good enough to genuinely get a good “right”/“wrong” score can surely be gently modified or toned down to give a good log-loss score.) As a third comment, for those not wishing to sacrifice statistical consistency in their efforts to improve predictive accuracy, it might be worth considering comments [Dowe, 2008a, footnote 130] about potential dangers of placing too much weight on the likelihood of the data. 7
MML AND SOME APPLICATIONS IN PHILOSOPHY AND ELSEWHERE
MML has been applied to a variety of problems in philosophy — including (e.g.) the philosophy of science [Dowe and Oppy, 2001], the philosophy of statistics and inference [Dowe et al., 2007], the philosophy of mind (see, e.g., sec. 7.3), the philosophy of language and the philosophy of religion. We mention these and some other — mainly philosophical — issues here.
MML, Hybrid Bayesian Network Graphical Models, ...
7.1
953
Objective Bayesianism (and Bertrand’s paradox) and some new invariant “objective priors”
Bertrand’s paradox is essentially concerned with the issue that we can not choose a uniform Bayesian prior in all parameterisations. Certainly, many authors would like an objective form of Bayesianism — or, equivalently, a parameterisation in which our Bayesian prior can be uniform. Recalling the notion of Universal Turing Machine (UTM) from sec. 2.4, one can claim that a simplest UTM is one with the smallest product of its number of states and its number of symbols (as this is the number of rows in the instruction table) [Dowe et al., to appear (b)]. Simplest UTMs have been used in inference [Wallace, 2005, sec. 2.3.12; Gammerman and Vovk, 2007a; 2007b; Martin-L¨of, 1966; Dowe, 2007], and they are one way of attempting to be objective (while still being Bayesian) and — as a consequence — side-stepping Bertrand’s paradox. Of course, such objectivity — where possible — will potentially be useful in places such as legal battles [Dowe, 2008a, pp. 438–439]. Much good and interesting work has been done in the area of objective Bayesianism by (e.g.) J. M. Bernardo [Bernardo and Smith, 1994] and others. Above, we follow Wallace in offering simplest UTMs as objective priors. Below, as per the end of sec. 6.7, we now change tack and re-visit the Jeffreys “prior” (whose use, incidentally, was not advocated by Jeffreys [1946]) as an objective invariant prior. Some things that we know to be invariant upon re-parameterisation include the R likelihood function, Maximum Likelihood, the marginal probability r(xi ) = h(θ)f (xi |θ) dθ of datum xi (from sec. 6.1), the message length and many of its variants (from secs. 6 — 6.3), Minimum Message Length (when using an invariant form), the Fisher information and (recalling secs. 4.1 and 6.9) the minimum Expected Kullback-Leibler divergence (minEKL) estimator. Given these invariant building blocks, we now take the Jeffreys “prior” (which we recall from sec. 6.7 and pointers therein does not always normalise), and construct a family of other invariant priors. To kick off with an example, let us take a (m-state, (M − 1)-dimensional, M ≥ 2) multinomial distribution — as per [Wallace, 2005, sec. 5.4] — with prior of the form Const p1 α1 ...pM αM (where we use αi where Wallace [2005, sec. 5.4] writes αi − 1). We will have counts si (i = 1, ..., M ) PM PM and we let N = i=1 si and A = i=1 αi . The Wallace-Freeman (1987) MML estimate from equation (20) is (ˆ pi )M M LW F 1987 = (si +αi +1/2)/(N +A+M/2). And, recalling sec. 6.7, the minEKL (or Laplace) estimate (equivalently here, the posterior mean [Boulton and Wallace, 1969]) is (ˆ pi )M inEKL = (si +αi +1)/(N +A+M ). As such, we observe that the Wallace-Freeman [1987] MML estimate from equation (20) with the Jeffreys “prior” (αi = −1/2) gives the Maximum Likelihood estimate. We similarly observe that the Wallace-Freeman estimate with the uniform prior (αi = 0) is equivalent to getting the minEKL estimate with the Jeffreys “prior” (αi = −1/2). In this particular case of the multinomial distribution, we note that we can trans-
954
David L. Dowe
form from the (invariant) Wallace-Freeman [1987] MML estimate to (invariant) minEKL by adding 1/2 to the αi . As such, if the Jeffreys prior h0 = hJef f reys = hF isherInf o (with αi = −1/2) is to be called objective, then a case can be made that so, too, is the uniform prior h1 (with αi = 0). We can iterate again to get further invariant priors: h2 (with αi = 1/2), h3 (with αi = 1), etc. One could also iterate in the opposite direction: h−1 (with αi = −1), h−2 (with αi = −3/2), etc. All such priors — or at least those which normalise — are invariant (by construction) and can be regarded in some sense as “objective”. One could then choose the prior hj for the smallest value of j for which hj normalises. This method of using the Jeffreys “prior” to generate further invariant objective priors (via invariant transformations) and then taking the “first” to normalise certainly generalises — well beyond the above example of the multinomial distribution — to other distributions. In general, for some given distribution, start again with h0 = hJef f reys = hF isherInf o and then, given prior hi , let hi+1 be the prior such that (some invariant form of) the MML estimate with prior hi+1 is as close as possible in Kullback-Leibler distance (and ideally equal) to the minEKL estimate with prior hi . With however much ease or difficulty, we can then generate this sequence of invariant priors h0 = hJef f reys = hF isherInf o , h1 , h2 , ... and perhaps also h−1 , h−2 , etc. (As a general rule, because of MML’s tendency to fit just right and minEKL’s tendency to under-fit as per sec. 4.1, we expect to see a corresponding progression in this sequence of priors — as is perhaps best seen from the above example with the multinomial distribution. In that case, hi has αi = (i − 1)/2, meaning that, as i increases, it takes increasingly much data to move the estimate away from the centroid where for all i, pˆi = 1/M .) If there does exist some smallest j for which hj normalises, a case could be made that this is an objective invariant prior which might be more suitable than the Jeffreys “prior”, h0 . Penultimately, cases could be made p for investigating combining two such priors, as in considering (e.g.) hhybrid = hj1 hj2 . Cases could also be made for attempting to allow j not to be an integer but rather somehow to be fractional. We will not investigate these here. And, finally, returning to issues from sec. 6.7, it would perhaps be nice if Normalised Maximum Likelihood (NML) could be re-visited but with use of one of these alternative invariant priors to the Jeffreys “prior”, h0 . This should retain the statistical invariance but might reduce some of the vulnerabilities (such as over-fitting and statistical inconsistency, both mentioned earlier) associated with Maximum Likelihood.
7.2
Goodman’s “Grue” paradox (and choice of language)
Nelson Goodman’s “grue” paradox raises the issue of why notions like “green” and “blue” should be more natural than notions of “grue” (green before time t0 [say year 3000] and blue thereafter) and “bleen” (blue before time t0 [say year 3000] and green thereafter). This has been discussed in the Solomonoff predictive
MML, Hybrid Bayesian Network Graphical Models, ...
955
and Wallace MML inductive frameworks, with relevant writings being [Solomonoff, 1996; 1997b, sec. 5; Comley and Dowe, 2005, sec. 11.4.4; Dowe, 2008a, footnotes 128, 184 and 227]. Among other things, an adequate solution of when to arrive at a notion like “grue” and when to arrive at a notion like “green” (which is, after all, grue before time t0 and bleen thereafter) is presumably necessary when trying to evolve language (for those beings not yet with language) or when trying to communicate with non-human terrestrials or extra-terrestrials [Dowe, 2008a, footnote 184]. Wallace’s approach from [Dowe, 2008a, footnote 128], elaborating upon [Comley and Dowe, 2005, sec. 11.4.4], was summarised as follows: “Suppose someone is growing and harvesting crops, commencing (much) before t0 and finishing (much) after t0 . We expect the grass and certain moulds to be green, and we expect the sky and certain weeds to be blue. The notions of grue and bleen here offer at most little in return other than sometimes to require (time-based) qualification and to make the language sometimes unnecessarily cumbersome.” This said, there are times of event changes which can be of interest. If t0 were the time of the next expected reversal of the earth’s magnetic field, then in talking on such a time-scale we have reason to disambiguate between magnetic north and geographic north in our language — as these notions are approximately equal before t0 and approximately antipodal (for at least some time) after t0 [Dowe, 2008a, footnote 128]. But the terms ‘grue’ and ‘bleen’ cost us but seem to gain us nothing. By and large, languages will develop, import, qualify and/or abbreviate terms when these terms warrant (sufficient) use. And, while on that very issue of abbreviation, the reader will note at least one place in this article where we have written “Minimum Message Length (MML)”. This convention of putting an acronym or other abbreviation in brackets immediately after the term it abbreviates enables us to use the abbreviation (rather than the full term) elsewhere — thus enabling us to shorten the length of our message. And, digressing, while on the earlier issue of languages, MML has been used to model evolution of languages [Ooi and Dowe, 2005; Dowe, 2008a, sec. 0.2.4; 2008b, p. 455] (not to mention finite state automata [Wallace and Georgeff, 1983] and DNA string alignment [Allison et al., 1990a; 1990b; 1990; 1991; 1992a; 1992b; Allison and Wallace, 1993, 1994a, 1994b]). An able philosopher colleague, Toby Handfield, has told me in private communication — while discussing Lewis [1976] and the “laws of nature” — that if MML were able to recognise a number constructed as (say) the sum without carries of e (the base of natural logarithms) expanded in hexadecimal (base 16) and π expanded in decimal (base 10), then this would go a long way towards convincing him that MML can solve Goodman’s grue paradox. Using the relationship between MML and (algorithmic information theory or) Kolmogorov complexity [Wallace and Dowe, 1999a; Wallace, 2005, chaps. 2–3] discussed in sec. 6, we outline the argument below. In short, MML will have no difficulty with doing this (in principle) — the caveat being that the search might take quite some time. P∞ We can specify e as i=0 1/i! in a (Turing machine) program of length Pe , and we can specify the hth hex(adecimal) digit of e in a program of length Pe +C1 +l(h)
956
David L. Dowe
for some constant C1 , where l(h) is the length of some prefix code (recall sec. 2.2) over the positive integers (e.g., the unary code from sec. 2.4). We could use a code of length 1 for h = 1 and of length ≤ 1 + ⌈1 + log2 (h) + 2 log2 (log2 (h))⌉ < 2 + 1P+ log2 (h) + 2 log2 (log2 (h)) for h ≥ 2. Similarly, we can specify π as (e.g.) ∞ i i=0 (4(−1) )/(2i + 1) in a (Turing machine) program of length Pπ , and we can specify the hth hex(adecimal) digit of π in a program of length Pπ + C2 + l(h) for some constant C2 , where l(h) is as above. The program for addition without carry/ies simply entails addition without carry (or modulo addition) in each place, h, for h = 1, 2, 3, 4, .... So, for the hth hex(adecimal) digit, we can say that the hth hex digit, compositeh , of our composite number is given as follows: if then else
(e_{h, 16} + pi_{h, 10} < 15) composite_h = e_{h, 16} + pi_{h, 10} composite_h = e_{h, 16} + pi_{h, 10} - 16;
Given that this is how the composite number is being generated, given sufficiently many hex digits of this number, the Minimum Message Length (MML) inference will be the algorithm for generating this composite number. Again, the search might be slow, but this will be found. We can actually take this further by randomly adding noise. Let us suppose that, with probability p, hex P15digit h comes from some probability distribution (q1 , q2 , ..., q14 , q15 , q16 = 1 − i=1 qi ) and with probability 1 − p this hth hex digit will be compositeh . So, P r(CompositeWithNoiseh = compositeh ) = pqcompositeh + (1 − p). For each i 6= compositeh , P r(CompositeWithNoiseh = i) = pqi . In the case that p = 0, this reduces to the noiseless case. Here, the search will be even slower, but with sufficiently many digits and with sufficient search time, we will converge upon the noiseless program above generating the digits in addition to having an increasingly good quantification of the noise.
7.3
MML, inductive inference, explanation and intelligence
As intimated in sec. 1, MML gives us the inductive inference (or induction, or inference, or explanation) part of intelligence [Dowe and Hajek, 1997; 1998, especially sec. 2 (and its title) and sec. 4; Sanghi and Dowe, 2003, sec. 5.2]. And Ockham’s razor tells us that we should expect to improve on Searle’s “Chinese room” look-up table [Searle, 1980] by having a compressed representation — as per our commonsense intuition and arguments in [Dowe and Hajek, 1997, sec. 5.1 and elsewhere; 1997, p. 105, sec. 5 and elsewhere; Dowe, 2008a, footnote 182 and surrounding text] and sec. 4. Let us consider an assertion by Hutter [Legg and Hutter, 2007] that compression is equivalent to (artificial) intelligence (although subsequent work by Hutter now seems instead to equate intelligence with a weighted sum of reward scores across different environments). This assertion is later than a similar idea of HernandezOrallo [Hern´ andez-Orallo and Minaya-Collado, 1998; Hern´andez-Orallo, 2000]. It
MML, Hybrid Bayesian Network Graphical Models, ...
957
is also stronger than an earlier idea [Dowe and Hajek, 1997; 1998, especially sec. 2 (and its title) and sec. 4; Sanghi and Dowe, 2003, sec. 5.2 and elsewhere; Dowe, 2008a, sec. 0.2.5, p. 542, col. 2 and sec 0.2.7, p. 545, col. 1] that (the part of intelligence which is) inductive inference (or inductive inference) is equivalent to (two-part) compression. Let us look at the two issues separately of • (i) first, whether all of (artificial) intelligence or perhaps just inductive inference is equivalent to (two-part) compression, and • (ii) second, whether it is satisfactory simply to talk about (one-part) compression or whether we should insist upon two-part compression. First, the components of intelligence would appear to include (at least) memory, deductive inference, inductive inference and ability to receive direct instruction. (By deductive inference, we mean and include mathematical calculations and logical reasoning, such as modus ponens — Socrates is a man, all men are mortal, therefore Socrates is mortal. To illustrate the distinction with an example, inductive inference is more along the lines of all men are mortal, Socrates is mortal, therefore we assert some probability that Socrates is a man.) We need memory to store observations for making inductive inferences, for remembering inductive inferences, for remembering our progress through mathematical calculations or other (logical) deductions and for remembering those direct instructions (perhaps the deductions or inferences of others) that we receive. For example, a good human player of a game where the search space is too vast to be exhaustively searched (like chess or Go) needs inductive inference and direct instruction to help with an evaluation function (such as, in chess, the advantages of passed pawns, the weaknesses of isolated and backward pawns, and the approximate equivalence between a queen and three minor pieces), memory to remember these, memory to remember the rules, and deduction (and memory again) to do the lookahead calculations in the search tree. It is clear that all these aspects of intelligence are useful to a human player of such a game. However, to the mathematician, the logician, and especially a mathematician or a logician checking the validity of a proof (or someone double-checking that a Sudoku solution is correct), the main forms of intelligence required would surely appear to be deduction and memory. It is fair to say that the harder aspects of inductive learning and (two-part) compression also require memory and deductive inference. And we have argued elsewhere that we are more likely to attribute intelligence to someone performing an act of great memory if they have done this using a compressed representation [Dowe and Hajek, 1997; 1998]. But we ask the reader whether we should not attribute intelligence to the chess player or the mathematician (or the person checking a Sudoku solution) when performing (difficult) activities involving at most little inductive inference. Second, in many cases, doing straight (one-part) compression rather than twopart compression can lead to an incorrect model (as in the statistical inconsistency of the minEKL estimator from sec. 4.1 for the “gappy” problem mentioned in sec. 4.1 and for the Neyman-Scott problem in sec. 6.4) — and this remains true asymptotically regardless of how much data we have.
958
David L. Dowe
As per [Dowe, 2008a, sec. 0.2.7, p. 545, col. 1], I have discussed with J. Hernandez Orallo the notion of quantifying the intelligence of a system of agents and endeavouring to quantify how much of this comes from the individual agents (in isolation) and how much comes from their communication. Let us try to take this further in a couple of different (related) ways. First, it would be good to (artificially) evolve such a communal intelligence, including (perhaps inevitably) evolving a language. (As a tiny step, one of my 4th year Honours project students in 2009, Jeffrey R. Parsons, has made slight progress in evolving Mealy and/or Moore machines with the message length as a guiding fitness function. I do not wish to overplay his current progress, but it is in a useful direction.) And, second, re the topics of swarm intelligence and ant colony optimisation, perhaps only a very small range of parameter values (where the parameters describe the individual agents and/or their communication) permit the different parts to interact as an “intelligent” whole. This raises a couple of further issues: the issue of using MML to analyse data (as per sec. 7.6 and [Dowe, 2008a, sec. 0.2.7, p. 545]) and infer the parameter values (or setting) giving the greatest communal intelligence, and the additional issue(s) of whether or not greater prior weight should be given to those systems giving the interesting outcome of intelligence, and (similarly) — in the fine tuning argument of sec. 7.7 and [Dowe et al., to appear (b)] — whether greater prior probability should be given to parameter settings in which interesting universes (like our own) result. Having mentioned here the issues of intelligence, non-human intelligence and communication, it is worth mentioning some of Chris Wallace’s comments about trying to communicate with an alien intelligence [Dowe, 2008a, sec. 0.2.5, p. 542, col. 2, and also footnote 184 and perhaps text around footnote 200] (and possibly also worth recalling Goodman’s notion of “grue” from sec. 7.2 and [Dowe, 2008a, footnote 128]). We conclude here by saying that further discussion on some of the topics in this sub-section will appear in [Hern´andez-Orallo and Dowe, 2010].
7.4
(So-called) Causality
Chris Wallace did much work on “causal nets” using MML, including doing the (MML) mathematics and writing the software behind several papers on this topic [Wallace and Korb, 1994; Wallace, 1996b; Wallace et al., 1996a; 1996b; Dai et al., 1996a; 1996b; 1997a; 1997b; Wallace and Korb, 1997; 1999; Korb and Wallace, 1997; 1999] (with the possible exception of [Neil et al., 1999a; 1999b]). I have no objection to the quality of Wallace’s MML statistical inference — from the available data — in this work. Indeed, I have (at most little or) nothing but the highest praise for it. However, there are at least two or so matters about which one should express caution when interpreting the results from such inference. One issue is that getting the wrong statistical model (because we didn’t have enough data, we hadn’t searched thoroughly enough and/or our statistical inference method was sub-optimal) can lead to having arrows pointing the wrong way.
MML, Hybrid Bayesian Network Graphical Models, ...
959
And even if we did have enough data, our statistical inference method was ideal and we searched thoroughly, it could still be the case that the true (underlying) model (from which the data has been generated) is outside the family of models that we are considering — e.g., our model family might be restricted to linear regressions (on the parent “explanatory” variables to the “target” child variable) with Gaussian noise while the real data-generating process might be more complicated. In such cases, slightly modifying the family of models being considered might change the directions of arrows in the inference, suggesting that the directions of these arrows should not all be regarded as directly “causal” [Dowe, 2008a, footnote 169]. As a related case in point, we might have data of (at least) two variables, including (i) height/co-ordinates of (weather) (monitoring) station and (ii) (air) pressure reading. Our best statistical model might have arrows from pressure reading to height of monitoring station, but we surely shouldn’t interpret this arrow as being in any way causal. Of course, temporal knowledge (of the order in which things occur) is also important for attributing causality. Whether or not this is well-known and well-documented (and please pardon my medical ignorance, as per sec. 3.2), there would appear to be a substantial overlap between cancer patients and stroke patients. Let’s suppose that in many cases the patient has a stroke and then cancer is detected some months later. It could appear that the stroke caused the cancer, but it is perhaps more probable that cancer-induced changes in the tissue and/or the bloodstream caused the stroke — even if the primary cancer was not in the brain and the metastatic cancer did not present in the brain until after the stroke. If this is all true, then it would suggest that the actual mechanism is that the Cancer is causing the Stroke — despite the possibility that an analysis of the data might easily lead one to conclude that the Stroke is causing the (Brain) Cancer. We also have to be careful about issues such as (e.g.) hidden (or unknown) variables. As an example, a hidden latent variable might cause both A (which takes place slightly before B) and B. B might do the same exam paper (of mathematical calculations) as A but starting and finishing slightly later. Or perhaps B is a newspaper which goes to print after newspaper A goes to print but before newspaper A appears on the stands. We expect B’s answers and stories to be very similar to A’s, but this is because A and B have common (hidden) causes; and it seems loose to say that A causes B. As another example, standing on a podium of a Grand Prix tends to greatly increase one’s chances of winning at a subsequent Grand Prix event. But this wouldn’t be true of one of the scaffolding constructors who tested the podium before the race ceremony, and nor would it be true of some overly exuberant spectator who managed to somehow get access to the podium. Rather, there is a (not very) hidden cause of ability causing someone to do well in two races, and the result of doing well in the first of these races caused that racer to stand on the podium at the end of the first of the two races.
960
David L. Dowe
Lecturers (at university, college, or wherever), tutors, teachers and instructors who are able to give lectures without notes (of whom Chris Wallace is but one notable example) often give excellent lectures. Let us assume that this is the norm for such people. The cause of the good lecturing is surely the excellent memory and strong command of the subject of the lecturer rather than any supposed benefit that the average person might supposedly gain by trying to lecture without notes. As another example, if A causes B and B causes C and we only know about A and C but have not yet even conceived of B (and, of course, we might be open to the existence of B but simply don’t know), then I think we can say A “causations” C but we have to be careful about saying that (e.g., supposedly) A causes C. A specific case might be the old example of A being living alone, C being having (few or) no rodents at home. B is the owning of a pet cat — the single person keeps the pet for company (and has no flat-mate or house-mate to complain), and the cat keeps the rodents at bay. Possibly see also [Dowe, 2008a, sec. 0.2.7, pp. 543–544] re LNPPP and causality.
7.5
Elusive model paradox (and encryption)
G¨odel’s incompleteness theorem consists of constructing a mathematical statement which can be interpreted as saying that “This statement is not provable” [G¨odel, 1931]. Clearly, this statement can’t be false, or it would be provable and hence true, leading to a logical contradiction. Hence, the statement must be both true (of the natural numbers) and not provable. The original version of the elusive model paradox gives us a sequence where the next number is one (or unity) more than what we would expect it to be [Dowe, 2008a, footnote 211]. The subsequent version of the paradox essentially takes modulo 2 (so that even numbers are transformed to 0 and odd numbers are transformed to 1) and then gives us a binary sequence (or bit string) (of 0s and 1s) in which we can (paradoxically) be sure that the next bit is not the bit that we expect (or would have predicted) based on what we have seen so far (before it). This leads to a contradiction from which the only escape would appear to be the undecidability of the Halting problem (or Entscheidungsproblem), the notion that there are many calculations which will never terminate but for which we can never know that they will not terminate [Turing, 1936]. Whether one takes the elusive model paradox as being over a sequence of (increasing) positive integers (as per the original version [Dowe, 2008a, footnote 211]) or over a binary bit string sequence of 0s and 1s (as per the later version [Dowe, 2008b, p. 455]), each of these versions in turn can be thought of in two (essentially) equivalent ways. One of these ways is to play this as a game, where we have one agent (which can be represented by a Turing machine) generating the sequence and a group of one or more agents (which can also be represented by a Turing machine) trying to guess the next bit — while the (Turing machine) agent generating the sequence is attempting to generate the opposite bit to what (the Turing machine representing) those guessing will guess. (It might possibly help to
MML, Hybrid Bayesian Network Graphical Models, ...
961
think of the generating agent as a soccer player taking a penalty kick, trying to kick the ball where the goalie won’t be — or as a tennis player trying to serve the ball to where the receiver won’t be; and, in turn, the guessing agent as the goalie trying to guess the location of the kick or the tennis receiver trying to anticipate the serve.) To give both the generating Turing machine agent and the guessing Turing machine agent the best chances to do their respective jobs properly, we will assume that — recalling sec. 2.4 — these Turing machines are universal. As such, among other things, there will be finite emulation programs (or translation programs) causing one machine to emulate the other, and vice versa. As the generating program and the guessing program start out on small sequences being the early short initial segments of the generated bits and the guess(ed) bits, the programs will quite possibly have different models of the data. But, as the sequences get longer and longer, after they become at least kilobits, megabits, gigabits, terabits, etc. long and vastly longer than the abovementioned translation programs, the models that these two UTMs have of the available data will start to converge. The guessing UTM will have had a very good look at the generating UTM and — given that the generating UTM is a finite deterministic machine — the guessing UTM would appear to be able at some stage to lock in on the behaviour of the generating UTM, thereafter guessing all subsequent bits correctly. Similarly, at some stage, the generating UTM would appear to be able at some stage to lock in on the behaviour of the guessing UTM, thereafter anticipating all its guesses and then flipping the bit before generating it. After both these stages have occurred, we have the contradiction that the guessing UTM always guesses correctly and the generating UTM anticipates the guess, flips the bit that it knows will be guessed and ensures that all subsequent guesses are incorrect. The Halting problem gets us out of this paradox (and seems like the only way out), as both the generating UTM and the guessing UTM can and very often want more time before they are content that they have modelled the other correctly. The second (essentially) equivalent way of thinking of the elusive model paradox is simply that the generating UTM agent and the guessing UTM agent are the same — as at the end of the previous paragraph. After starting off the sequence, we guess which bit should most probably come next, and then generate the bit which is least probable to come next — and then continue this indefinitely. We get (essentially) the same paradox, and again the Halting problem seems like the only way out of the paradox. The above all said by way of introduction, we now present some variations on the elusive model paradox [Dowe, 2008a, footnote 211; 2008b, p. 455], including — recalling sec. 4.1 — one using inference and one using prediction. (Recall that inference uses the single best model whereas prediction weights over all available models.) One variation is that we can restrict ourselves to multinomial Markov models where the nth order Markov model has (a maximum of) 2n binomial distributions. Let j = jm = jm (i) ≤ i be some unbounded non-decreasing computable function of i. At step i, having bits b1 , b2 , ..., bi , we choose bi+1 as follows, from the following two similar but (slightly) different methods — noting that both these constructions
962
David L. Dowe
are computable. Method 1 (inference — using restricted “memory”): We infer the best (MML) Markov model of order ≤ jm based on b1 , b2 , ..., bi . We then use the predictive distribution from this MML inference to give a probability distribution for bi+1 . We then choose bi+1 to be the bit with the least predicted probability. Method 2 (prediction — using restricted “memory”): We use Bayesian model averaging over all the Markov models of order ≤ i to get a predictive probability distribution over bi+1 . Again, we choose bi+1 to be the bit which has the lowest predicted probability. With both of these methods — method 1 (inference) and method 2 (prediction) — the resultant sequence is “random” in the sense that no Markov model of finite order is going to be able to compress it And this is so because the construction of the sequence is to destroy any such structure at the first viable opportunity upon its detection. Recall that both these constructions immediately above based on restricting “memory” are computable. Two (or more) alternative computable constructions — based on restricting computation time rather than “memory” — are given below. Let j = jt = jt (i) > i be some strictly increasing computable function of i. Method 3 (inference — with restricted computation time): We infer the best (Minimum Message Length [MML]) inference from all computable functions (that we search over) within a search time of ≤ jt based on b1 , b2 , ..., bi . As in method 1, we then use the predictive distribution from this MML inference to give a probability distribution for bi+1 . We then choose bi+1 to be the bit with the least predicted probability. Method(s) 4 (prediction — with restricted computation time): We use Bayesian model averaging. There are two ways of proceeding further in restricted finite computation time — method 4(a) and (with tighter restriction) method 4(b). Method 4(a): We use Bayesian model averaging over all the Markov models of order ≤ i to get a predictive probability distribution over bi+1 . Here, time restriction of ≤ jt is applied to each of the individual Markov models in turn. They are then averaged as per Method 2. Again, we choose bi+1 to be the bit which has the lowest predicted probability. Method 4(b): We use Bayesian model averaging over all the Markov models of order ≤ i to get a predictive probability distribution over bi+1 . But, here, the time restriction of ≤ jt is tighter in that it is applied to the entire calculation, including the final Bayesian model averaging. And, yet again, we choose bi+1 to be the bit which has the lowest predicted probability. We might refer to these various sequences emerging from variations of our elusive model paradox as “red herring” sequences. Among other things, these (red herring sequences) have the potential to be used in encryption. If various people or agents studying the sequence have varying computational resources (e.g., different lag lengths in the Markov models they can consider), a variant of the sequence can be constructed in such a way as to guide some sub-population (perhaps those from whom we wish to conceal some data) to believe in the presence or absence
MML, Hybrid Bayesian Network Graphical Models, ...
963
of a particular pattern while guiding a different sub-population (perhaps those to whom we wish to divulge the data) to be aware of the presence (or absence) of some (particular) pattern. Finally, let us return to the notes at the start of this subsection about how one (apparently) needs the Halting problem (or Entscheidungsproblem) [Turing, 1936] to resolve the elusive model paradox [Dowe, 2008a, footnote 211; 2008b, p. 455]. The Halting problem is something which people do not normally encounter before their undergraduate university years. I put it to the reader that the elusive model paradox is something from which we can deduce the halting problem yet which should be accessible to school students.
7.6
Some of many other issues which MML can address
• In numerical integration (or numerical quadrature), we see a variety of approaches such as the rectangular rule, the trapezoidal rule and Simpson’s rule, etc. Of course, where the function we are trying to integrate is not generated from a polynomial and especially when it is generated from a noisy process, then it will typically be better to use MML or a related method to guide the fitting process rather than use arbitrarily complicated polynomials and suffer from the over-fitting problems that come with Maximum Likelihood and similar methods; • generalized hybrid Bayesian network graphical models [Dowe and Wallace, 1998; Comley and Dowe, 2003; Tan and Dowe, 2004, sec. 5; Comley and Dowe, 2005] deal with the issue of “discriminative vs generative” studied by Jebara [2003] and others (e.g., [Long and Servedio, 2006]). Many authors have claimed that discriminative learning can often outperform generative learning. However, if one follows the ideas in [Dowe and Wallace, 1998; Comley and Dowe, 2003; Tan and Dowe, 2004, sec. 5; Comley and Dowe, 2005] and carefully uses MML — recalling the discussion of poor coding schemes in sec. 6.7, taking care with the coding scheme — to construct one’s generalized hybrid Bayesian network graphical model (of which inference of a logic program via Inductive Logic Programming [ILP] [Dowe et al., to appear (a)] is one possible outcome, as can be SVMs from sec. 6.6 or also, e.g., the sort of model from [Oliver and Dowe, 1995]), then the statistical consistency results of MML from [Dowe, 2008a, sec. 0.2.5] and discussed in sec. 6 should guarantee that “generative” learning (when done like this) works fine. Some properties of these generalized hybrid Bayesian network graphical models (which can include both continuous and discrete variables [Comley and Dowe, 2003; 2005]) are discussed in secs. 2.3 (where it is mentioned in passing that entropy can be defined on such hybrid structures) and 3.6 (where we mention that there is no difficulty in defining Kullback-Leibler divergence over such structures); • following this point, where an unnormalised database is sufficiently large,
964
David L. Dowe
then MML inference will lead to database normalisation [Dowe, 2008a, sec. 0.2.6, footnote 187; 2008b, pp. 454–455; Dowe and Zaidi, 2010], often resulting in several tables, conveying a generalised Bayesian net. We can adjust our priors to require this process to be noise-free; • experimental design [Dowe, 2008a, sec. 0.2.7, p. 544; 2008b, pp. 445–446]; • MML can be applied to statistical hypothesis testing [Dowe, 2008a, sec. 0.2.5, p. 539 and sec. 0.2.2, p. 528, col. 1; 2008b, p. 433 (Abstract), p. 435, p. 445 and pp. 455–456; Musgrave and Dowe, 2010], as can also MDL [Rissanen, 1999a, sec. 3]. (Perhaps see also [Dowe, 2008a, sec. 1].) Daniel F. Schmidt and Enes Makalic have recently presented work in front of an audience including me showing their desire to take this further. As per sec. 7.1, I harbour some concerns about associating the Maximum Likelihood estimate with Normalised Maximum Likelihood; • association rules (from “data mining” and machine learning) can be incorporated within a generalised Bayesian network structure; • re-visiting A. Elo’s Elo system and M. Glickman’s Glicko system for chess ratings. Whether for chess players with the advantage of the white pieces and the first move or whether for sports teams with a home ground advantage, MML can both select the relevant model and do the parameter estimation. Of interest would be a Neyman-Scott-like situation in which many games are being played but there are relatively few games per player. If similar interest would be a situation with several groups of players where there are many games played within each of the groups but very games played between members of different groups; • directional angular data, such as the von Mises circular distribution [Wallace and Dowe, 1993; 1994a; Dowe et al., 1995a; 1995b] and the von Mises-Fisher spherical distribution [Dowe et al., 1996e; 1996f]; • inference of megalithic stone circle (or non-circle) geometries [Patrick and Wallace, 1977; Patrick, 1978; Patrick and Wallace, 1982]; • polynomial regression [Wallace, 1997; Wallace, 1998c; Viswanathan and Wallace, 1999; Rumantir and Wallace, 2001; Fitzgibbon et al., 2002a; Rumantir and Wallace, 2003] (and perhaps also [Schmidt and Makalic, 2009c]); • inference of MML neural nets [Makalic et al., 2003]; • inference of MML decision trees (or classification trees) and decision graphs (or classification graphs) [Oliver and Wallace, 1991; Oliver et al., 1992; Oliver and Wallace, 1992; Oliver, 1993; Uther and Veloso, 2000; Tan and Dowe, 2002; Tan and Dowe, 2003; Tan and Dowe, 2004], including (as per sec. 6.6) decision trees with support vector machines (SVMs) in their leaves [Tan and
MML, Hybrid Bayesian Network Graphical Models, ...
965
Dowe, 2004] — with applications of MML decision trees and graphs in a variety of areas including (e.g.) protein folding [Dowe et al., 1992; 1992a; 1993] and medical diagnosis [McKenzie et al., 1993]; • MML clustering, mixture modelling and intrinsic classification via the Snob program [Wallace and Boulton, 1968; Wallace, 1984b; 1986; 1990b; 1990c; Wallace and Dowe, 1994b; 1996; 1997a; 1997b; 2000] was originally for the multinomial and Gaussian distributions, but this was extended to also include the Poisson and von Mises circular distributions [Wallace and Dowe, 1994b; 1996; 1997a; 1997b; 2000] — with applications in a variety of areas including (e.g.) spectral modelling [Papp et al., 1993], protein folding [Zakis et al., 1994], psychology and psychiatry [Kissane et al., 1994; 1996; 1996a; Prior et al., 1998]. Also of interest is determining whether our data appears to contain one line segment or a mixture of more than one line segment [Georgeff and Wallace, 1984a; 1984b; 1985] (and much later work on engineering bridge deterioration using a mixture of a Poisson distribution and a uniform distribution with total assignment [Maheswaran et al., 2006]). (After the MML mixture modelling of multinomial, Gaussian, Poisson and von Mises circular distributions from 1994 [Wallace and Dowe, 1994b; 1996] came a slightly different paper doing only MML Gaussian mixture modelling [Oliver et al., 1996] but emphasising the success of MML in empirical comparisons.) The MML single linear factor analysis from [Wallace and Freeman, 1992] was incorporate into [Edwards and Dowe, 1998] — although [Edwards and Dowe, 1998] did total (rather than partial) assignment and only did single rather than multiple [Wallace, 1995a; 1998b] factor analysis. This has also been extended to a variety of forms of sequential clustering [Edgoose and Allison, 1999; Molloy et al., 2006], with an extension of [Edgoose and Allison, 1999] being (as in the next item) to MML spatial clustering. See also [Boulton and Wallace, 1973b; Dowe, 2008a, sec. 0.2.3, p. 531, col. 1 and sec. 0.2.4, p. 537, col. 2] for a discussion of MML hierarchical clustering. As well as the abovementioned multinomial, Gaussian, Poisson and von Mises circular distributions [Wallace and Dowe, 1994b; 1996; 1997a; 1997b; 2000], this work — without sequential and spatial clustering (following in the next item) has been extended to other distributions [Agusta and Dowe, 2002a, 2002b; 2003a; 2003b; Bouguila and Ziou, 2007]; • extensions of MML spatial clustering [Wallace, 1998a; Visser and Dowe, 2007] to tomographic [Visser et al., 2009a] and climate [Visser et al., 2009b] models. Variations on this work (and possibly other MML image analysis work [Torsello and Dowe, 2008a; 2008b]) should lend themselves both to training a robot to learn a model for and then recognise a particular class of object (such as a coloured shape, like a particular type of fruit) for robotic hand-eye co-ordination and also to analysing data in constructing a bionic eye; • inference of systems of one or more probabilistic/stochastic (partial or) or-
966
David L. Dowe
dinary (difference or) differential equations (plus at least one noise term) from (presumably noisy) data (as per wishes from [Dowe, 2008a, sec. 0.2.7, p. 545]). Uses for this should include the likes of (e.g.) inferring parameter settings for ant colonies and other swarms so as to model them and/or suggesting settings giving better “intelligence” (as per sec. 7.3) and medical applications such as (e.g.) cardiac modelling or modelling stem cells; • MML, particle physics and the analysis of the data in the search for the Higgs boson [Dowe, 2008a, sec. 0.2.7, p. 544, col. 2]; • etc. Also of interest here might be • the relationship between MML and the likelihood principle of statistics [Wallace, 2005, sec 5.8; Wallace and Dowe, 1999b, sec. 2.3.5], for which MML’s violation is “innocent enough — a misdemeanour rather than a crime” [Wallace, 2005, sec. 5.8, p. 254; Dowe, 2008a, sec. 0.2.4, p. 535, col. 2]; • the relationship between MML and Ed Jaynes’s notion of maximum entropy (or MaxEnt) priors [Jaynes, 2003; Wallace, 2005, secs. 1.15.5 and 2.1.11; Dowe, 2008a, sec. 0.2.4, p. 535, col. 1]. While MML and MaxEnt are different, while still on the topic of entropy and MML, it turns out that — within the MML mixture modelling literature — the term used to shorten the message length when going from (the inefficient coding scheme of) total assignment to (the efficient coding scheme of) partial assignment equates to the entropy of the posterior probability distribution of the class assignment probabilities [Visser et al., 2009b; Wallace, 1998a; Visser and Dowe, 2007]; • etc.
7.7
MML and other philosophical issues
When time next permits, here are some of the many other philosophical issues to which MML pertains • entropy is not time’s arrow [Wallace, 2005, chap. 8 (and p. vii); Dowe, 2008a, sec. 0.2.5; 2008b, p. 455], and note the ability of MML to detect thermal fluctuations and not over-fit them where some other statistical methods might be tempted to regard the standard noisy fluctuations as being something more [Wallace, 2005, chap. 8]. (One wonders whether the formation of these ideas might be evident in [Wallace, 1973a].) Recalling sec. 4.1 on inference (or explanation) and prediction, one interesting thing here is Wallace’s take on why it is that we wish to predict the future but (only) infer (or explain) the past [Wallace, 2005, chap. 8];
MML, Hybrid Bayesian Network Graphical Models, ...
967
• being able to accord something the title of a “miracle” [Wallace, 2005, sec. 1.2, p. 7; Dowe, 2008a, sec. 0.2.7, p. 545, col. 1; 2008b, p. 455], the fine tuning argument in intelligent design [Dowe et al., to appear (b)] (possibly see also sec. 7.3) and evidence that there is an intelligent supervisor/shepherd listening to our prayers and overseeing — and sometimes intervening in — our lives. Just as we can use MML to decide whether or not to accord something the title of a miracle, so, too, we can set about being objective and using MML to quantify the probability of certain coincidences and whether or not there could be an intelligent supervisor/shepherd intervening in our lives. Of course, such a shepherd might be able to make discrete minor changes effectively impossible to notice in one place which lead to substantial changes in other places. (I am not offering my opinion one way or another here, but rather merely raising how this issue might be addressed in an MML framework); • Efficient Markets [Dowe and Korb, 1996; Dowe, 2008a, sec. 0.2.5; 2008b, p. 455] — due to the Halting problem, MML and Kolmogorov complexity essentially say (in short) that financial markets are very unlikely to be efficient and next to impossible to be proved efficient. Attempts to make this point more accessible by showing the effects of having a variety of trading approaches equal in all ways but one where one trader is better in terms of (e.g.) inference method, speed or memory, are given in [Collie et al., 2005; 2005a]; • redundant Turing Machines (unlike those of sec. 2.4), for which pre- and postprocessing can be used to effectively emulate a (Universal) Turing Machine by non-conventional means [Dowe, 2008a, sec. 0.2.7, p. 544]; • undecidability in (optimal) engineering tuning and design (ultimately due to the Halting problem); • probabilities of conditionals and conditional probabilities [Dowe, 2008a, sec. 0.2.7, p. 546]; • information and MML re originality of an idea, degree of creativity of an act or design — or humour [Dowe, 2008a, sec. 0.2.7, p. 545] (the reader is welcome to inspect not unrelated ideas in [Solomonoff, 1995; Schmidhuber, 2007] in order to determine the originality of this idea). Puns typically entail finding commonality between at least two different subject matters. The finding of such commonality is crucial to the creation of the pun, and the recognising of such commonality is crucial to the understanding of the pun. A similar comment applies to the creation and solving of (cryptic) clues from a (cryptic) crossword. This said, in my experience, creating puns seems to be more difficult than understanding them — whereas solving cryptic crossword clues seems to be more difficult than creating them;
968
David L. Dowe
• mnemonics (or memory aids), whether or not this should be regarded as a philosophical issue. Certain mnemonics are compressions or contractions from which we can re-construct that “data” that we ultimately wish to recall. However, there is a seemingly slightly curious phenomenon here. People might recall (e.g.) the periodic table of elements or (e.g.) the base 10 decimal expansion of π by recalling a mnemonic sequence of words which tells a story. In the case of the periodic table, these words (can) start with the first one or so letters of the chemical elements in sequence. In the case of π, these words in sequence (can) have lengths corresponding to the relevant digit: so, the length of the ith word is the ith digit of π — e.g., “How I want a drink, alcoholic of course, after the heavy chapters involving quantum mechanics” (for 3.14159265358979). These little stories are fairly easily remembered — one might say that they are compressible, so one can re-construct them fairly easily, from where one can go on to re-construct what one was really trying to recall. However, the slight curiousity is that for all the easy compressible niceties of the mnemonic sequence, it is actually longer than the original. Perhaps the resolution is that whatever slight redundancies there are in the mnemonics serve as error corrections. So, perhaps such mnemonics are quite compressible in their own right so that they can easily be re-constructed but have sufficiently much redundancy to reduce errors. I think there is room for further discussion on this topic; • fictionalism is an area of philosophy which (according to my understanding of it) is concerned about the sense in which we can talk about some fictional character (e.g., Elizabeth Bennett from “Pride and Prejudice”) as though they were real — and then go on to discuss how said character(s) might behave in some scenario not presented in the story in which said character(s) appear(s). This seems very much to be an MML matter. We form a model of the character(s) based on what we know about the(se) character(s). We have a model of how various types of real-world character behave in certain scenarios, and we go from there. In similar vein, MML has much to say about the philosophical notion of counterfactuals and possible worlds, although here there is a minor issue of (recall sec. 4.1) whether we are interested in inference as to how things would most probably be in the nearest possible world or instead a weighted prediction of how things might be — obtained by doing a weighted combination of predictions over a variety of possible worlds; • etc. Having made the above list, I now mention some issues to which I would like to be able to apply MML. • virtue — Confucius (the Chinese philosopher) [Confucius, 1938] and (about 500 years later) Jesus (from whom we have the Christian religion) are wellknown for their comments on consideration, and Confucius further for his
MML, Hybrid Bayesian Network Graphical Models, ...
969
other comments on virtue. In game theory (e.g., prisoners’ dilemma), perhaps virtue is being/doing as all would need to do to give the best all-round solution (in similar vein to the merits of co-operation described in, e.g., [Wallace, 1998d]), perhaps it is doing the optimum by the others on the presumption that the others will all act out of self-interest. Perhaps MML can offer some insight here; and • the Peter principle — I hear of and sometimes see far too many shocking appointments of the undeserving, of a candidate who “peaked in the interview”. Perhaps in terms of some sort of experimental design or more properly knowing how to analyse the available data on candidates, it would be nice to put MML to use here. 8
ACKNOWLEDGEMENTS
I first thank Prasanta Bandyopadhyay, Malcolm Forster and John Woods for permitting me to contribute this piece. I thank the anonymous referee — whose helpful feedback clearly indicated that said referee had read my submission closely. I especially thank Prasanta for his regular patient editorial monitoring of my progress (or lack thereof), at least comparable in magnitude to that of an additional referee. I next thank my mother and my family [Comley and Dowe, 2005, p. 287, sec. 11.6]. I now quote Confucius [1938, Book I, saying 15]: “Tzu-kung said, ‘Poor without cadging, rich without swagger.’ What of that? The Master said, Not bad. But better still, ‘Poor, yet delighting in the Way; rich, yet a student of ritual.’ ” I thank Chris Wallace [Gupta et al., 2004; Dowe, 2008a, footnote 218] (for a variety of matters — such as, e.g., the letter referred to in [Dowe, 2008a, footnote 218] and all the training he gave me over the years), for whose Christopher Stewart WALLACE (1933-2004) memorial special issue of the Computer Journal [Dowe, 2008a; Brennan, 2008; Solomonoff, 2008; Jorgensen and McLachlan, 2008; Brent, 2008; Colon-Bonet and Winterrowd, 2008; Castro et al., 2008] it was an honour to be guest editor. (For other mention and examples of the range and importance of his work, see, e.g., [Parry, 2005] and [Clark and Wallace, 1970].) I talk there of his deserving at least one Turing Award [Dowe, 2008a, sec. 0.2.2, p. 526 and sec. 0.2.4, p. 533, col. 2] and of how his work on entropy not being time’s arrow might have stood him in contention for the Nobel prize in Physics if it could be experimentally tested [Dowe, 2008a, sec. 0.2.5, footnote 144]. (And, possibly, as per [Dowe, 2008a, sec. 0.2.7, p. 544, col. 2] and sec. 7.6, MML will have a role to play in the discovery of the Higgs boson.) I forgot to point out that, if he’d lived a few decades longer, the increasing inclusion of his work in econometrics (see, e.g., [Fitzgibbon et al., 2004; Dowe, 2008a, sec. 0.2.3, footnote 88] and sec. 6.5 — and possibly also these papers on segmentation and cut-points [Oliver et al., 1998; Viswanathan et al., 1999; Fitzgibbon et al., 2002b] — for teasers) might have one day earned him the Nobel prize in Economics. (Meanwhile, it would be good to re-visit both ARCH [Autoregressive Conditional Heteroskedasticity] and
970
David L. Dowe
GARCH [Generalised ARCH] using MML.) For a list of his publications (MML and other), see either the reference list to [Dowe, 2008a] (which also lists the theses he supervised) or www.csse.monash.edu.au/~dld/CSWallacePublications. And, finally, I thank Fran Boyce [Dowe, 2008a, footnote 217; Obituaries, 2009], one of the nicest, gentlest, most thoughtful and most gracious people one could meet or ever even hope to meet.
BIBLIOGRAPHY [Obituaries, 2009] Obituaries: Drive to learn despite struggle with lupus. The Herald Sun, page 79, Thu. 6 Aug., 2009. Melbourne, Australia; Author: M. Sonogan. [Agusta, 2005] Y. Agusta. Minimum Message Length Mixture Modelling for Uncorrelated and Correlated Continuous Data Applied to Mutual Funds Classification. PhD thesis, School of Computer Science and Software Engineering, Clayton School of I.T., Monash University, Clayton, Australia, 2005. [Agusta and Dowe, 2002b] Y. Agusta and D. L. Dowe. Clustering of Gaussian and t distributions using minimum message length. In Proc. International Conference on Knowledge Based Computer Systems (KBCS 2002), pages 289–299. Vikas Publishing House, 2002. http://www.ncst.ernet.in/kbcs2002. [Agusta and Dowe, 2002a] Y. Agusta and D. L. Dowe. MML clustering of continuous-valued data using Gaussian and t distributions. In B. McKay and J. Slaney, editors, Lecture Notes in Artificial Intelligence (LNAI), Proc. Australian Joint Conference on Artificial Intelligence, volume 2557, pages 143–154, Berlin, Germany, December 2002. Springer-Verlag. [Agusta and Dowe, 2003b] Y. Agusta and D. L. Dowe. Unsupervised learning of correlated multivariate Gaussian mixture models using MML. In Lecture Notes in Artificial Intelligence (LNAI) 2903 (Springer), Proc. 16th Australian Joint Conf. on Artificial Intelligence, pages 477–489, 2003. [Agusta and Dowe, 2003a] Y. Agusta and D. L. Dowe. Unsupervised learning of Gamma mixture models using minimum message length. In M. H. Hamza, editor, Proceedings of the 3rd IASTED conference on Artificial Intelligence and Applications, pages 457–462, Benalmadena, Spain, September 2003. ACTA Press. [Allison and Wallace, 1993] L. Allison and C. S. Wallace. The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimisation of multiple alignments. Technical report CS 93/188, Dept Computer Science, Monash University, Melbourne, Australia, 1993. [Allison and Wallace, 1994a] L. Allison and C. S. Wallace. An information measure for the string to string correction problem with applications. 17th Australian Comp. Sci. Conf., pages 659–668, January 1994. Australian Comp. Sci. Comm. Vol 16 No 1(C) 1994. [Allison and Wallace, 1994b] L. Allison and C. S. Wallace. The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimization of multiple alignments. J. Mol. Evol., 39(4):418–430, October 1994. an early version: TR 93/188, Dept. Computer Science, Monash University, July 1993. [Allison et al., 1990] L. Allison, C. S. Wallace, and C. N. Yee. Induction inference over macro– molecules. Technical Report 90/148, Monash University, Clayton, Victoria, Australia, 3168, 1990. [Allison et al., 1990b] L Allison, C S Wallace, and C N Yee. Inductive inference over macromolecules. In Working Notes AAAI Spring Symposium Series, pages 50–54. Stanford Uni., Calif., U.S.A., 1990. [Allison et al., 1990a] L. Allison, C. S. Wallace, and C. N. Yee. When is a string like a string? In International Symposium on Artificial Intelligence and Mathematics, January 1990. [Allison et al., 1991] L Allison, C S Wallace, and C N Yee. Minimum message length encoding, evolutionary trees and multiple-alignment. Technical report CS 91/155, Dept Computer Science, Monash University, Melbourne, Australia, 1991.
MML, Hybrid Bayesian Network Graphical Models, ...
971
[Allison et al., 1992b] L. Allison, C. S. Wallace, and C. N. Yee. Finite-state models in the alignment of macro-molecules. J. Mol. Evol., 35(1):77–89, July 1992. extended abstract titled: Inductive inference over macro-molecules in joint sessions at AAAI Symposium, Stanford, Mar 1990 on (i) Artificial Intelligence and Molecular Biology, pp5-9 & (ii) Theory and Application of Minimal-Length Encoding, pp50-54. [Allison et al., 1992a] L. Allison, C. S. Wallace, and C. N. Yee. Minimum message length encoding, evolutionary trees and multiple alignment. 25th Hawaii Int. Conf. on Sys. Sci., 1:663–674, January 1992. Another version is given in TR 91/155, Dept. Computer Science, Monash University, Clayton, Vic, Australia, 1991. [Barron and Cover, 1991] A.R. Barron and T.M. Cover. Minimum complexity density estimation. IEEE Transactions on Information Theory, 37:1034–1054, 1991. [Baxter and Oliver, 1995] R. A. Baxter and J. J. Oliver. MDL and MML: Similarities and differences. Technical report TR 94/207, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, 1995. [Bernardo and Smith, 1994] J.M. Bernardo and A.F.M. Smith. Bayesian Theory. Wiley, New York, 1994. [Bouguila and Ziou, 2007] N. Bouguila and D. Ziou. High-dimensional unsupervised selection and estimation of a finite generalized Dirichlet mixture model based on minimum message length. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10):1716–1731, October 2007. [Boulton, 1970] D. M. Boulton. Numerical classification based on an information measure. Master’s thesis, M.Sc. thesis, Basser Computing Dept., University of Sydney, Sydney, Australia, 1970. [Boulton, 1975] D. M. Boulton. The Information Measure Criterion for Intrinsic Classification. PhD thesis, Dept. Computer Science, Monash University, Clayton, Australia, August 1975. [Boulton and Wallace, 1969] D. M. Boulton and C. S. Wallace. The information content of a multistate distribution. J. Theor. Biol., 23:269–278, 1969. [Boulton and Wallace, 1970] D. M. Boulton and C. S. Wallace. A program for numerical classification. Computer Journal, 13(1):63–69, February 1970. [Boulton and Wallace, 1973c] D. M. Boulton and C. S. Wallace. A comparison between information measure classification. In Proc. of the Australian & New Zealand Association for the Advancement of Science (ANZAAS) Congress, August 1973. abstract. [Boulton and Wallace, 1973b] D. M. Boulton and C. S. Wallace. An information measure for hierarchic classification. Computer Journal, 16(3):254–261, 1973. [Boulton and Wallace, 1973a] D. M. Boulton and C. S. Wallace. Occupancy of a rectangular array. Computer Journal, 16(1):57–63, 1973. [Boulton and Wallace, 1975] D. M. Boulton and C. S. Wallace. An information measure for single link classification. Computer Journal, 18(3):236–238, 1975. [Brennan, 2008] M. H. Brennan. Data processing in the early cosmic ray experiments in Sydney. Computer Journal, 51(5):561–565, September 2008. [Brennan et al., 1958] M. H. Brennan, D. D. Millar, and C. S. Wallace. Air showers of size greater than 105 particles - (1) core location and shower size determination. Nature, 182:905– 911, Oct. 4 1958. [Brent, 2008] R. P. Brent. Some comments on C. S. Wallace’s random number generators. Computer Journal, 51(5):579–584, September 2008. [Castro et al., 2008] M. D. Castro, R. D. Pose, and C. Kopp. Password-capabilities and the Walnut kernel. Computer Journal, 51(5):595–607, September 2008. [Chaitin, 1966] G. J. Chaitin. On the length of programs for computing finite sequences. Journal of the Association for Computing Machinery, 13:547–569, 1966. [Chaitin, 2005] G. J. Chaitin. Meta Math! The Quest for Omega. Pantheon, 2005. ISBN 0-375-42313-3 (978-0-375-42313-0). [Clark and Wallace, 1970] G. M. Clark and C. S. Wallace. Analysis of nasal support. Archives of Otolaryngology, 92:118–129, August 1970. [Clarke, 1999] B. Clarke. Discussion of the papers by Rissanen, and by Wallace and Dowe. Computer J., 42(4):338–339, 1999. [Collie et al., 2005] M. J. Collie, D. L. Dowe, and L. J. Fitzgibbon. Stock market simulation and inference technique. In Fifth International Conference on Hybrid Intelligent Systems (HIS’05), Rio de Janeiro, Brazil, Nov 2005.
972
David L. Dowe
[Collie et al., 2005a] M. J. Collie, D. L. Dowe, and L. J. Fitzgibbon. Trading rule search with autoregressive inference agents. Technical report CS 2005/174, School of Computer Science and Software Engineering, Monash University, Melbourne, Australia, 2005. [Colon-Bonet and Winterrowd, 2008] G. Colon-Bonet and P. Winterrowd. Multiplier evolution — a family of multiplier VLSI implementations. Computer Journal, 51(5):585–594, September 2008. [Comley and Dowe, 2003] Joshua W. Comley and David L. Dowe. General Bayesian networks and asymmetric languages. In Proc. Hawaii International Conference on Statistics and Related Fields, 5-8 June 2003. [Comley and Dowe, 2005] Joshua W. Comley and David L. Dowe. Minimum message length and generalized Bayesian nets with asymmetric languages. In P. Gr¨ unwald, M. A. Pitt, and I. J. Myung, editors, Advances in Minimum Description Length: Theory and Applications (MDL Handbook), pages 265–294. M.I.T. Press, April 2005. Chapter 11, ISBN 0-262-07262-9. Final camera-ready copy submitted in October 2003. [Originally submitted with title: “Minimum Message Length, MDL and Generalised Bayesian Networks with Asymmetric Languages”.]. [Confucius, 1938] Confucius. The Analects of Confucius (Lun Y¨ u). Vintage Books, 1989. (Published earlier with Macmillan in 1938.) Translated by Arthur Waley. Online at http://myweb.cableone.net/subru/Confucianism.html. [Dai et al., 1996a] H Dai, K B Korb, and C S Wallace. The discovery of causal models with small samples. In 1996 Australian New Zealand Conference on Intelligent Information Systems Proceedings ANZIIS96, pages 27–30. IEEE, Piscataway, NJ, USA, 1996. [Dai et al., 1996b] H Dai, K B Korb, and C. S. Wallace. A study of causal discovery with weak links and small samples. Technical report CS 96/298, Dept Computer Science, Monash University, Melbourne, Australia, 1996. [Dai et al., 1997b] H Dai, K B Korb, C S Wallace, and X Wu. A study of causal discovery with weak links and small samples. Technical report SD TR97-5, Dept Computer Science, Monash University, Melbourne, Australia, 1997. [Dai et al., 1997a] Honghua Dai, Kevin B. Korb, C. S. Wallace, and Xindong Wu. A study of causal discovery with weak links and small samples. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, IJCAI 97, pages 1304–1309, 1997. [Dawid, 1999] A. P. Dawid. Discussion of the papers by Rissanen and by Wallace and Dowe. Computer J., 42(4):323–326, 1999. [Deakin, 2001] M. A. B. Deakin. The characterisation of scoring functions. J. Australian Mathematical Society, 71:135–147, 2001. [Dowe, 2007] D. L. Dowe. Discussion following “Hedging predictions in machine learning, A. Gammerman and V. Vovk”. Computer Journal, 2(50):167–168, 2007. [Dowe, 2008a] D. L. Dowe. Foreword re C. S. Wallace. Computer Journal, Christopher Stewart WALLACE (1933–2004) memorial special issue, 51(5):523–560, September 2008. [Dowe, 2008b] D. L. Dowe. Minimum Message Length and statistically consistent invariant (objective?) Bayesian probabilistic inference — from (medical) “evidence”. Social Epistemology, 22(4):433–460, October–December 2008. [Dowe et al., 1995] D. L. Dowe, L. Allison, T. I. Dix, L. Hunter, C. S. Wallace, and T. Edgoose. Circular clustering by minimum message length of protein dihedral angles. Technical report CS 95/237, Dept Computer Science, Monash University, Melbourne, Australia, 1995. [Dowe et al., 1996] D. L. Dowe, L. Allison, T. I. Dix, L. Hunter, C. S. Wallace, and T. Edgoose. Circular clustering of protein dihedral angles by minimum message length. In Pacific Symposium on Biocomputing ’96, pages 242–255. World Scientific, January 1996. [Dowe et al., 1998] D. L. Dowe, R. A. Baxter, J. J. Oliver, and C. S. Wallace. Point estimation using the Kullback-Leibler loss function and MML. In X. Wu, Ramamohanarao Kotagiri, and K. Korb, editors, Proceedings of the 2nd Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining (PAKDD-98), volume 1394 of LNAI, pages 87–95, Berlin, April 15–17 1998. Springer. [Dowe et al., 1996a] D. L. Dowe, G. E. Farr, A. J. Hurst, and K. L. Lentin. Information-theoretic football tipping. 3rd Conf. on Maths and Computers in Sport, pages 233–241, 1996. See also Technical Report TR 96/297, Dept. Computer Science, Monash University, Australia 3168, Dec 1996. [Dowe et al., 1996b] D. L. Dowe, G. E. Farr, A. J. Hurst, and K. L. Lentin. Information-theoretic football tipping. Technical report TR 96/297, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, 1996.
MML, Hybrid Bayesian Network Graphical Models, ...
973
[Dowe et al., to appear (a)] D. L. Dowe, C. Ferri, J. Hernandez-Orallo, and M. Jose RamirezQuintana. An MML Scheme for ILP. Submitted. [Dowe et al., 2007] D. L. Dowe, S. Gardner, and G. R. Oppy. Bayes not bust! Why simplicity is no problem for Bayesians. British Journal for the Philosophy of Science, 58(4):709–754, December 2007. [Dowe et al., to appear (b)] D. L. Dowe, S. Gardner, and G. R. Oppy. MML and the fine tuning argument. To appear, 2011. [Dowe and Hajek, 1997] D. L. Dowe and A. R. Hajek. A computational extension to the Turing test. Technical Report 97/322, Dept. Computer Science, Monash University, Australia 3168, October 1997. [Dowe and Hajek, 1998] D. L. Dowe and A. R. Hajek. A non-behavioural, computational extension to the Turing test. In Proceedings of the International Conference on Computational Intelligence & Multimedia Applications (ICCIMA’98), pages 101–106, Gippsland, Australia, February 1998. [Dowe et al., 1996c] D. L. Dowe, A.J. Hurst, K.L. Lentin, G. Farr, and J.J. Oliver. Probabilistic and Gaussian football prediction competitions - Monash. Artificial Intelligence in Australia Research Report, June 1996. [Dowe et al., 1998a] D. L. Dowe, M Jorgensen, G McLachlan, and C S Wallace. Informationtheoretic estimation. In W Robb, editor, Proceedings of the Fourteenth Biennial Australian Statistical Conference (ASC-14), page 125, Queensland, Australia, July 1998. [Dowe and Korb, 1996] D. L. Dowe and K. B. Korb. Conceptual difficulties with the efficient market hypothesis: Towards a naturalized economics. In Proc. Information, Statistics and Induction in Science (ISIS), pages 212–223, 1996. See also Technical Report TR 94/215, Dept. Computer Science, Monash University, Australia 3168, 1994. [Dowe and Krusel, 1993] D. L. Dowe and N. Krusel. A decision tree model of bushfire activity. Technical report TR 93/190, Dept. of Computer Science, Monash University, Clayton, Vic. 3800, Australia, September 1993. [Dowe and Lentin, 1995] D. L. Dowe and K. L. Lentin. Information-theoretic footy-tipping competition — Monash. Computer Science Association Newsletter (Australia), pages 55–57, December 1995. [Dowe et al., 1996d] D. L. Dowe, K.L. Lentin, J.J. Oliver, and A.J. Hurst. An informationtheoretic and a Gaussian footy-tipping competition. FCIT Faculty Newsletter, Monash University, Australia, pages 2–6, June 1996. [Dowe et al., 1992] D. L. Dowe, J. J. Oliver, L. Allison, T. I. Dix, and C. S. Wallace. Learning rules for protein secondary structure prediction. In C. McDonald, J. Rohl, and R. Owens, editors, Proc. 1992 Department Research Conference. Dept. Computer Science, University of Western Australia, July 1992. [Dowe et al., 1992a] D. L. Dowe, J. J. Oliver, L. Allison, C. S. Wallace, and T. I. Dix. A decision graph explanation of protein secondary structure prediction. Technical report CS 92/163, Dept Computer Science, Monash University, Melbourne, Australia, 1992. [Dowe et al., 1995a] D. L. Dowe, J. J. Oliver, R. A. Baxter, and C. S. Wallace. Bayesian estimation of the von Mises concentration parameter. In Proc. 15th Int. Workshop on Maximum Entropy and Bayesian Methods, Santa Fe, July 1995. [Dowe et al., 1995b] D. L. Dowe, J. J. Oliver, R. A. Baxter, and C. S. Wallace. Bayesian estimation of the von Mises concentration parameter. Technical report CS 95/236, Dept Computer Science, Monash University, Melbourne, Australia, 1995. [Dowe et al., 1993] D. L. Dowe, J. J. Oliver, T. I. Dix, L. Allison, and C. S. Wallace. A decision graph explanation of protein secondary structure prediction. 26th Hawaii Int. Conf. Sys. Sci., 1:669–678, January 1993. [Dowe et al., 1996e] D. L. Dowe, J. J. Oliver, and C. S. Wallace. MML estimation of the parameters of the spherical Fisher distribution. In Algorithmic Learning Theory, 7th International Workshop, ALT ’96, Sydney, Australia, October 1996, Proceedings, volume 1160 of Lecture Notes in Artificial Intelligence, pages 213–227. Springer, October 1996. [Dowe et al., 1996f] D. L. Dowe, J. J. Oliver, and C. S. Wallace. MML estimation of the parameters of the spherical Fisher distribution. Technical report CS 96/272, Dept Computer Science, Monash University, Melbourne, Australia, 1996. [Dowe and Oppy, 2001] D. L. Dowe and G. R. Oppy. Universal Bayesian inference? Behavioral and Brain Sciences (BBS), 24(4):662–663, Aug 2001.
974
David L. Dowe
[Dowe and Wallace, 1996] D. L. Dowe and C. S. Wallace. Resolving the Neyman-Scott problem by minimum message length (abstract). In Proc. Sydney Int. Stat. Congress, pages 197–198, 1996. [Dowe and Wallace, 1997a] D. L. Dowe and C. S. Wallace. Resolving the Neyman-Scott problem by Minimum Message Length. In Proc. Computing Science and Statistics — 28th Symposium on the interface, volume 28, pages 614–618, 1997. [Dowe and Wallace, 1997b] D. L. Dowe and C. S. Wallace. Resolving the Neyman-Scott problem by Minimum Message Length. Technical report TR no. 97/307, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, February 1997. Also in Proc. Sydney Int. Stat. Congr. (SISC-96), Sydney, pages 197-198; and in IMS Bulletin (1996), 25 (4), pp410-411. [Dowe and Wallace, 1998] D. L. Dowe and C. S. Wallace. Kolmogorov complexity, minimum message length and inverse learning. In W. Robb, editor, Proceedings of the Fourteenth Biennial Australian Statistical Conference (ASC-14), page 144, Queensland, Australia, July 1998. [Dowe and Zaidi, 2010] D. L. Dowe and N. A. Zaidi. Database normalization as a by-product of minimum message length inference. Im Proc. 23rd Australian Joint Conference on Artificial Intelligence (AI’2010), Adelaide, Australia, 7-10 December 2010, pp. 82–91. Springer Lecture Notes in Artificial Intelligence (LNAI), vol. 6464, Springer, 2010. [Edgoose and Allison, 1999] T. Edgoose and L. Allison. MML Markov classification of sequential data. Stats. and Comp., 9(4):269–278, September 1999. [Edgoose et al., 1998] T. Edgoose, L. Allison, and D. L. Dowe. An MML classification of protein structure that knows about angles and sequence. In Pacific Symposium on Biocomputing ’98, pages 585–596. World Scientific, January 1998. [Edwards and Dowe, 1998] R. T. Edwards and D. L. Dowe. Single factor analysis in MML mixture modelling. In Xindong Wu, Ramamohanarao Kotagiri, and Kevin B. Korb, editors, Proceedings of the 2nd Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining (PAKDD-98), volume 1394 of Lecture Notes in Artificial Intelligence (LNAI), pages 96–109, Berlin, April 15–17 1998. Springer. [Farr and Wallace, 2002] G. E. Farr and C. S. Wallace. The complexity of strict minimum message length inference. Computer Journal, 45(3):285–292, 2002. [Fitzgibbon, 2004] L. J. Fitzgibbon. Message from Monte Carlo: A Framework for Minimum Message Length Inference using Markov Chain Monte Carlo Methods. PhD thesis, School of Computer Science and Software Engineering, Clayton School of I.T., Monash University, Clayton, Australia, 2004. [Fitzgibbon et al., 2002b] L. J. Fitzgibbon, D. L. Dowe, and Lloyd Allison. Change-point estimation using new minimum message length approximations. In Lecture Notes in Artificial Intelligence (LNAI) 2417, 7th Pacific Rim International Conference on Artificial Intelligence (PRICAI), pages 244–254. Springer, 2002. [Fitzgibbon et al., 2002a] L. J. Fitzgibbon, D. L. Dowe, and Lloyd Allison. Univariate polynomial inference by Monte Carlo message length approximation. In 19th International Conference on Machine Learning (ICML), pages 147–154, 2002. [Fitzgibbon et al., 2004] L. J. Fitzgibbon, D. L. Dowe, and F. Vahid. Minimum message length autoregressive model order selection. In Proc. Int. Conf. on Intelligent Sensors and Information Processing, pages 439–444, Chennai, India, January 2004. [Gammerman and Vovk, 2007a] Alex Gammerman and Vladimir Vovk. Hedging predictions in machine learning. Computer Journal, 2(50):151–163, 2007. [Gammerman and Vovk, 2007b] Alex Gammerman and Vladimir Vovk. Rejoinder: Hedging predictions in machine learning. Computer Journal, 2(50):173–177, 2007. [Georgeff and Wallace, 1984a] M. P. Georgeff and C. S. Wallace. A general selection criterion for inductive inference. In T. O’Shea, editor, Advances in Artificial Intelligence: Proc. Sixth European Conference on Artificial Intelligence (ECAI-84), pages 473–482, Amsterdam, September 1984. Elsevier Science Publishers B.V. (North Holland). [Georgeff and Wallace, 1984b] M. P. Georgeff and C. S. Wallace. A general selection criterion for inductive inference. Technical report TR 44, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, June 1984. [Georgeff and Wallace, 1985] M. P. Georgeff and C. S. Wallace. Minimum information estimation of structure. In T. O’Shea, editor, Advances in Artificial Intelligence, pages 219–228. Elsevier, 1985.
MML, Hybrid Bayesian Network Graphical Models, ...
975
[G¨ odel, 1931] K. G¨ odel. On formally undecidable propositions of Principia mathematica and related systems I. Monatshefte fur Mathematik und Physik, 38:173–198, 1931. [Good, 1952] I. J. Good. Rational decisions. J. Roy. Statist. Soc. B, B 14:107–114, 1952. [Gr¨ unwald, 2007] P. D. Gr¨ unwald. The Minimum Description Length principle (Adaptive Computation and Machine Learning). M.I.T. Press, 2007. [Gr¨ unwald et al., 1998] P. D. Gr¨ unwald, P. Kontkanen, P. Myllymaki, T. Silander, and H. Tirri. Minimum encoding approaches for predictive modeling. In Proceedings of the 14th International Conference on Uncertainty in Artificial Intelligence (UAI’98), pages 183–192, 1998. [Gr¨ unwald and Langford, 2004] Peter D. Gr¨ unwald and John Langford. Suboptimal behavior of Bayes and MDL in classification under misspecification. In COLT, pages 331–347, 2004. [Gr¨ unwald and Langford, 2007] Peter D. Gr¨ unwald and John Langford. Suboptimal behavior of Bayes and MDL in classification under misspecification. Machine Learning, 66:119–149(31), March 2007. [Gupta et al., 2004] G. K. Gupta, I. Zukerman, and D. W. Albrecht. Obituaries: Australia’s inspiring leader in the computing revolution — Christopher Wallace computer scientist. The Age, page 9, 2004. Melbourne, Australia; near and probably on Fri. 1 Oct. 2004. [Hern´ andez-Orallo, 2000] Jos´ e Hern´ andez-Orallo. Beyond the Turing test. Journal of Logic, Language and Information, 9(4):447–466, 2000. [Hern´ andez-Orallo and Dowe, 2010] Jos´ e Hern´ andez-Orallo and D. L. Dowe. Measuring universal intelligence: Towards an anytime intelligence test. Artificial Intelligence Journal, 174(18):1508–1539, 2010. [Hern´ andez-Orallo and Minaya-Collado, 1998] Jos´ e Hern´ andez-Orallo and N. Minaya-Collado. A formal definition of intelligence based on an intensional variant of Kolmogorov complexity. In Proceedings of the International Symposium of Engineering of Intelligent Systems, ICSC Press, pages 146–163, 1998. [Hodges, 1983] Andrew Hodges. Alan Turing: The Enigma. Simon and Schuster, 1983. [Hope and Korb, 2002] L. R. Hope and K. Korb. Bayesian information reward. In R. McKay and J. Slaney, editors, Proc. 15th Australian Joint Conference on Artificial Intelligence — Lecture Notes in Artificial Intelligence, Springer-Verlag, Berlin, Germany, ISSN: 0302-9743, Vol. 2557, number 2557 in Lecture Notes in Artificial Intelligence (LNAI), pages 272–283. Springer Verlag, 2002. [Huber, 2008] F. Huber. Milne’s argument for the log-ratio measure. Philosophy of Science, pages 413–420, 2008. [Jaynes, 2003] E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003. [Jebara, 2003] Tony Jebara. Machine Learning: Discriminative and Generative. Kluwer Academic Publishers, Norwell, MA, U.S.A., 2003. [Jeffreys, 1946] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proc. of the Royal Soc. of London A, 186:453–454, 1946. [Jorgensen and Gentleman, 1998] M. A. Jorgensen and R. Gentleman. Data mining. Chance, 11:34–39, 42, 1998. [Jorgensen and McLachlan, 2008] M. A. Jorgensen and G. J. McLachlan. Wallace’s approach to unsupervised learning: the Snob program. Computer Journal, 51(5):571–578, September 2008. [Kearns et al., 1997] M. Kearns, Y. Mansour, A. Y. Ng, and D. Ron. An experimental and theoretical comparison of model selection methods. Machine Learning, 27:7–50, 1997. [Kissane et al., 1994] D. W. Kissane, S. Bloch, W. I. Burns, J. D. Patrick, C. S. Wallace, and D. P. McKenzie. Perceptions of family functioning and cancer. Psycho-oncology, 3:259–269, 1994. [Kissane et al., 1996] D. W. Kissane, S. Bloch, D. L. Dowe, R. D. Snyder, P. Onghena, D. P. McKenzie, and C. S. Wallace. The Melbourne family grief study, I: Perceptions of family functioning in bereavement. American Journal of Psychiatry, 153:650–658, May 1996. [Kissane et al., 1996a] D. W. Kissane, S. Bloch, P. Onghena, D. P. McKenzie, R. D. Snyder, and D. L. Dowe. The Melbourne family grief study, II: Psychosocial morbidity and grief in bereaved families. American Journal of Psychiatry, 153:659–666, May 1996. [Kolmogorov, 1965] A. N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of Information Transmission, 1:4–7, 1965.
976
David L. Dowe
[Kontoyiannis, 2008] I. Kontoyiannis. Review of “Information and Complexity in Statistical Modeling by Jorma Rissanen”. American Mathematical Monthly — Reviews, 115:956–960, 2008. [Korb and Wallace, 1997] K B Korb and C. S. Wallace. In search of the philosopher’s stone: Remarks on Humphreys and Freedman’s critique of causal discovery. Technical report CS 97/315, Dept Computer Science, Monash University, Melbourne, Australia, 1997. [Korb and Wallace, 1999] K. B. Korb and C. S. Wallace. In search of the philosopher’s stone: Remarks on Humphreys and Freedman’s critique of causal discovery. British Jrnl. for the Philosophy of Science, pages 543–553, 1999. TR 97/315, Mar 1997, Dept. Computer Science, Monash University, Australia 3168. [Kornienko et al., 2005a] L. Kornienko, D. W. Albrecht, and D. L. Dowe. A preliminary MML linear classifier using principal components for multiple classes. In Proc. 18th Australian Joint Conference on Artificial Intelligence (AI’2005), volume 3809 of Lecture Notes in Artificial Intelligence (LNAI), pages 922–926, Sydney, Australia, Dec 2005. Springer. [Kornienko et al., 2005b] L. Kornienko, D. W. Albrecht, and D. L. Dowe. A preliminary MML linear classifier using principal components for multiple classes. Technical report CS 2005/179, School of Computer Sci. & Softw. Eng., Monash Univ., Melb., Australia, 2005. [Kornienko et al., 2002] Lara Kornienko, David L. Dowe, and David W. Albrecht. Message length formulation of support vector machines for binary classification — A preliminary scheme. In Lecture Notes in Artificial Intelligence (LNAI), Proc. 15th Australian Joint Conf. on Artificial Intelligence, volume 2557, pages 119–130. Springer-Verlag, 2002. [Kraft, 1949] L. G. Kraft, 1949. Master’s thesis, Dept. of Elec. Eng., M.I.T., U.S.A. [Lancaster, 2002] A. Lancaster. Orthogonal parameters and panel data. Review of Economic Studies, 69:647–666, 2002. [Legg and Hutter, 2007] S. Legg and M. Hutter. Universal intelligence: A definition of machine intelligence. Minds and Machines, 17(4):391–444, November 2007. [Lewis, 1976] David K. Lewis. Probabilities of conditionals and conditional probabilities. The Philosophical Review, 85(3):297–315, July 1976. [Li and Vit´ anyi, 1997] Ming Li and P. M. B. Vit´ anyi. An Introduction to Kolmogorov Complexity and its applications. Springer Verlag, New York, 1997. [Long and Servedio, 2006] Phil Long and Rocco Servedio. Discriminative learning can succeed where generative learning fails. In The 19th Annual Conference on Learning Theory, Carnegie Mellon University, Pittsburgh, Pennsylvania, U.S.A., 2006. [Maheswaran et al., 2006] T. Maheswaran, J. G. Sanjayan, D. L. Dowe, and P. J. Tan. MML mixture models of heterogeneous Poisson processes with uniform outliers for bridge deterioration. In Lecture Notes in Artificial Intelligence (LNAI) (Springer), Proc. 19th Australian Joint Conf. on Artificial Intelligence, pages 322 – 331, Hobart, Australia, Dec. 2006. [Makalic et al., 2003] E. Makalic, L. Allison, and D. L. Dowe. MML inference of single-layer neural networks. In Proc. of the 3rd IASTED Int. Conf. on Artificial Intelligence and Applications, pages 636–642, September 2003. See also Technical Report TR 2003/142, CSSE, Monash University, Australia Oct. 2003. [Martin-L¨ of, 1966] P. Martin-L¨ of. The definition of random sequences. Information and Control, 9:602–619, 1966. [McKenzie et al., 1993] D. P. McKenzie, P. D. McGorry, C. S. Wallace, L. H. Low, D. L. Copolov, and B. S. Singh. Constructing a minimal diagnostic decision tree. Methods in Information in Medicine, 32:161–166, 1993. [Milne, 1996] P. Milne. log[P r(H|E ∩ B)/P r(H|B)] is the one true measure of confirmation. Philosophy of Science, 63:21–26, 1996. [Molloy et al., 2006] S. B. Molloy, D. W. Albrecht, D. L. Dowe, and K. M. Ting. Model-Based Clustering of Sequential Data. In Proceedings of the 5th Annual Hawaii International Conference on Statistics, Mathematics and Related Fields, January 2006. [Murphy and Pazzani, 1994] P. Murphy and M. Pazzani. Exploring the decision forest: An empirical investigation of Occam’s razor in decision tree induction. Journal of Artificial Intelligence, 1:257–275, 1994. [Musgrave and Dowe, 2010] S. Musgrave and D. L. Dowe. Kinship, optimality and typology, Behavioral and Brain Sciences (BBS), 33(5), 2010. [Needham and Dowe, 2001] S. L. Needham and D. L. Dowe. Message length as an effective Ockham’s razor in decision tree induction. In Proc. 8th Int. Workshop on Artif. Intelligence and Statistics (AI+STATS 2001), pages 253–260, Jan. 2001.
MML, Hybrid Bayesian Network Graphical Models, ...
977
[Neil et al., 1999a] J. R. Neil, C. S. Wallace, and K. B. Korb. Learning Bayesian networks with restricted causal interactions. In Kathryn B. Laskey and Henri Prade, editors, Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI-99), pages 486–493, S.F., Cal., July 30–August 1 1999. Morgan Kaufmann Publishers. [Neil et al., 1999b] Julian R. Neil, C. S. Wallace, and K. B. Korb. Bayesian networks with non-interacting causes. Technical Report 1999/28, School of Computer Science & Software Engineering, Monash University, Australia 3168, September 1999. [Neyman and Scott, 1948] J. Neyman and E. L. Scott. Consistent estimates based on partially consistent observations. Econometrika, 16:1–32, 1948. [Oliver, 1993] J. J. Oliver. Decision graphs — an extension of decision trees. In Proceedings of the Fourth International Workshop on Artificial Intelligence and Statistics, pages 343–350, 1993. Extended version available as TR 173, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia. [Oliver and Baxter, 1994] J. J. Oliver and R. A. Baxter. MML and Bayesianism: similarities and differences. Technical report TR 94/206, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, 1994. [Oliver et al., 1998] J. J. Oliver, R. A. Baxter, and C. S. Wallace. Minimum message length segmentation. In Xindong Wu, Ramamohanarao Kotagiri, and Kevin B. Korb, editors, Proceedings of the 2nd Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining (PAKDD-98), volume 1394 of LNAI, pages 222–233, Berlin, April 15–17 1998. Springer. [Oliver et al., 1996] J. J. Oliver, Rohan A. Baxter, and Chris S. Wallace. Unsupervised learning using MML. In Proc. 13th International Conference on Machine Learning, pages 364–372. Morgan Kaufmann, 1996. [Oliver and Dowe, 1995] J. J. Oliver and D. L. Dowe. Using unsupervised learning to assist supervised learning. In Proc. 8th Australian Joint Conf. on Artificial Intelligence, pages 275– 282, November 1995. See also TR 95/235, Dept. Comp. Sci., Monash University, Australia 3168, Sep 1995. [Oliver et al., 1992] J. J. Oliver, D. L. Dowe, and C. S. Wallace. Inferring decision graphs using the minimum message length principle. In Proc. of the 1992 Aust. Joint Conf. on Artificial Intelligence, pages 361–367, September 1992. [Oliver and Hand, 1996] J. J. Oliver and D. J. Hand. Averaging on decision trees. Journal of Classification, 1996. An extended version is available as Technical Report TR 5-94, Dept. of Statistics, Open University, Walton Hall, Milton Keynes, MK7 6AA, UK. [Oliver and Hand, 1994] J. J. Oliver and D.J. Hand. Fanned decision trees. Technical report TR 5-94, Dept. of Statistics, Open University, Walton Hall, Milton Keynes, MK7 6AA, UK, 1994. [Oliver and Wallace, 1991] J. J. Oliver and C. S. Wallace. Inferring decision graphs. In Proceedings of the 12th International Joint Conference on Artificial Intelligence (IJCAI-91), workshop 8, January 1991. [Oliver and Wallace, 1992] J. J. Oliver and C. S. Wallace. Inferring decision graphs. Technical report CS 92/170, Dept Computer Science, Monash University, Melbourne, Australia, 1992. [Ooi and Dowe, 2005] J. N. Ooi and D. L. Dowe. Inferring phylogenetic graphs of natural languages using minimum message length. In CAEPIA 2005 (11th Conference of the Spanish Association for Artificial Intelligence), volume 1, pages 143–152, Nov. 2005. [Papp et al., 1993] E. Papp, D. L. Dowe, and S. J. D. Cox. Spectral classification of radiometric data using an information theory approach. In Proc. Advanced Remote Sensing Conf., pages 223–232, UNSW, Sydney, Australia, July 1993. [Parry, 2005] Leigh Parry. Midas touch. The Age newspaper, Melbourne, Australia (Education section), page 6 (in Education section), Mon. 20 June 2005. www.TheAge.com.au , www.monash.edu.au/policy/midas.htm. [Patrick, 1978] J. D. Patrick. An Information Measure Comparative Analysis of Megalithic Geometries. PhD thesis, Department of Computer Science, Monash University, Australia, 1978. [Patrick and Wallace, 1977] J. D. Patrick and C. S. Wallace. Stone circles: A comparative analysis of megalithic geometry. In Proc. 48th Australian & New Zealand Association for the Advancement of Science (ANZAAS) Conference. 1977. abstract.
978
David L. Dowe
[Patrick and Wallace, 1982] J. D. Patrick and C. S. Wallace. Stone circle geometries: an information theory approach. In D. Heggie, editor, Archaeoastronomy in the New World, pages 231–264. Cambridge University Press, 1982. [Phillips and Ploberger, 1996] P. C. B. Phillips and W. Ploberger. An asymptotic theory of Bayesian inference for time series. Econometrica, 64(2):240–252, 1996. [Pilowsky et al., 1969] I. Pilowsky, S. Levine, and D.M. Boulton. The classification of depression by numerical taxonomy. British Journal of Psychiatry, 115:937–945, 1969. [Prior et al., 1998] M. Prior, R. Eisenmajer, S. Leekam, L. Wing, J. Gould, B. Ong, and D. L. Dowe. Are there subgroups within the autistic spectrum? A cluster analysis of a group of children with autistic spectrum disorders. J. Child Psychol. Psychiat., 39(6):893–902, 1998. [Quinlan and Rivest, 1989] J.R. Quinlan and R.L. Rivest. Inferring decision trees using the minimum description length principle. Information and Computation, 80:227–248, 1989. [Rissanen, 1976] J. J. Rissanen. Generalized Kraft inequality and arithmetic coding. IBM J. Res. Develop., 20(3):198–203, May 1976. [Rissanen, 1978] J. J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978. [Rissanen, 1996] J. J. Rissanen. Fisher Information and Stochastic Complexity. IEEE Trans. on Information Theory, 42(1):40–47, January 1996. [Rissanen, 1999a] J. J. Rissanen. Hypothesis selection and testing by the MDL principle. Computer Journal, 42(4):260–269, 1999. [Rubinstein et al., 2007] B. Rubinstein, P. Bartlett, and J. H. Rubinstein. Shifting, one-inclusion mistake bounds and tight multiclass expected risk bounds. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19 (NIPS 2006). MIT Press, Cambridge, MA, U.S.A., 2007. [Rumantir and Wallace, 2001] G. W. Rumantir and C. S. Wallace. Sampling of highly correlated data for polynomial regression and model discovery. In The 4th International Symposium on Intelligent Data Analysis (IDA), pages 370–377, 2001. [Rumantir and Wallace, 2003] G. W. Rumantir and C. S. Wallace. Minimum message length criterion for second-order polynomial model selection applied to tropical cyclone intensity forecasting. In The 5th International Symposium on Intelligent Data Analysis (IDA), pages 486–496, 2003. ˜ [Sanghi and Dowe, 2003] P.Sanghi and D. L. Dowe. A computer program capable of passing I.Q. tests. In 4th International Conference on Cognitive Science (and 7th Australasian Society for Cognitive Science Conference), volume 2, pages 570–575, Univ. of NSW, Sydney, Australia, Jul 2003. [Schmidhuber, 2007] J. Schmidhuber. Simple algorithmic principles of discovery, subjective beauty, selective attention, curiosity & creativity. In Lecture Notes in Computer Science (LNCS) 4755, pages 26–38. Springer, 2007. [Schmidt, 2008] D. F. Schmidt. Minimum Message Length Inference of Autoregressive Moving Average Models. PhD thesis, Faculty of Information Technology, Monash University, 2008. [Schmidt and Makalic, 2009b] D. F. Schmidt and E. Makalic. Minimum message length shrinkage estimation. Statistics & Probability Letters, 79(9):1155–1161, 2009. [Schmidt and Makalic, 2009c] D. F. Schmidt and E. Makalic. MML invariant linear regression. In Lecture Notes in Artificial Intelligence (Proc. 22nd Australian Joint Conf. on Artificial Intelligence [AI’09]). Springer, December 2009. [Schwarz, 1978] G. Schwarz. Estimating dimension of a model. Ann. Stat., 6:461–464, 1978. [Searle, 1980] J. R. Searle. Minds, brains and programs. Behavioural and Brain Sciences, 3:417–457, 1980. [Shmueli and Koppius, 2007] G. Shmueli and O. Koppius. Predictive vs. Explanatory Modeling in IS Research. In Proc. Conference on Information Systems & Technology, 2007. Seattle, Wa, U.S.A. (URL www.citi.uconn.edu/cist07/5c.pdf). [Solomonoff, 1960] R. J. Solomonoff. A preliminary report on a general theory of inductive inference. Report V-131, Zator Co., Cambridge, Mass., U.S.A., 4 Feb. 1960. [Solomonoff, 1964] R. J. Solomonoff. A formal theory of inductive inference. Information and Control, 7:1–22,224–254, 1964. [Solomonoff, 1995] R. J. Solomonoff. The discovery of algorithmic probability: A guide for ˜ the programming of true creativity. In P.Vitanyi, editor, Computational Learning Theory: EuroCOLT’95, pages 1–22. Springer-Verlag, 1995.
MML, Hybrid Bayesian Network Graphical Models, ...
979
[Solomonoff, 1996] R. J. Solomonoff. Does algorithmic probability solve the problem of induction? In D. L. Dowe, K. B. Korb, and J. J Oliver, editors, Proceedings of the Information, Statistics and Induction in Science (ISIS) Conference, pages 7–8, Melbourne, Australia, August 1996. World Scientific. ISBN 981-02-2824-4. [Solomonoff, 1997a] R. J. Solomonoff. The discovery of algorithmic probability. Journal of Computer and System Sciences, 55(1):73–88, 1997. [Solomonoff, 1997b] R. J. Solomonoff. Does algorithmic probability solve the problem of induction? Report, Oxbridge Research, P.O.B. 400404, Cambridge, Mass. 02140, U.S.A., 1997. See http://world.std.com/∼rjs/isis96.pdf. [Solomonoff, 1999] R. J. Solomonoff. Two kinds of probabilistic induction. Computer Journal, 42(4):256–259, 1999. Special issue on Kolmogorov Complexity. [Solomonoff, 2008] R. J. Solomonoff. Three kinds of probabilistic induction: Universal and convergence theorems. Computer Journal, 51(5):566–570, September 2008. [Tan and Dowe, 2002] P. J. Tan and D. L. Dowe. MML inference of decision graphs with multiway joins. In R. McKay and J. Slaney, editors, Proc. 15th Australian Joint Conference on Artificial Intelligence — Lecture Notes in Artificial Intelligence, Springer-Verlag, Berlin, Germany, ISSN: 0302-9743, Vol. 2557, number 2557 in Lecture Notes in Artificial Intelligence (LNAI), pages 131–142. Springer Verlag, 2002. [Tan and Dowe, 2003] P. J. Tan and D. L. Dowe. MML inference of decision graphs with multiway joins and dynamic attributes. In Lecture Notes in Artificial Intelligence (LNAI) 2903 (Springer), Proc. 16th Australian Joint Conf. on Artificial Intelligence, pages 269–281, Perth, Australia, Dec. 2003. [Tan and Dowe, 2004] P. J. Tan and D. L. Dowe. MML inference of oblique decision trees. In Lecture Notes in Artificial Intelligence (LNAI) 3339 (Springer), Proc. 17th Australian Joint Conf. on Artificial Intelligence, volume 3339, pages 1082–1088, Cairns, Australia, Dec. 2004. [Tan and Dowe, 2006] P. J. Tan and D. L. Dowe. Decision forests with oblique decision trees. In Lecture Notes in Artificial Intelligence (LNAI) 4293 (Springer), Proc. 5th Mexican International Conf. Artificial Intelligence, pages 593–603, Apizaco, Mexico, Nov. 2006. [Tan et al., 2007] P. J. Tan, D. L. Dowe, and T. I. Dix. Building classification models from microarray data with tree-based classification algorithms. In Lecture Notes in Artificial Intelligence (LNAI) 4293 (Springer), Proc. 20th Australian Joint Conf. on Artificial Intelligence, Dec. 2007. [Torsello and Dowe, 2008a] A. Torsello and D. L. Dowe. Learning a generative model for structural representations. In Lecture Notes in Artificial Intelligence (LNAI), volume 5360, pages 573–583, 2008. [Torsello and Dowe, 2008b] A. Torsello and D. L. Dowe. Supervised learning of a generative model for edge-weighted graphs. In Proc. 19th International Conference on Pattern Recognition (ICPR2008). IEEE, 2008. 4pp. IEEE Catalog Number: CFP08182 , ISBN: 978-1-42442175-6 , ISSN: 1051-4651. [Turing, 1936] A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proc. London Math. Soc. 2, 42:230–265, 1936. [Uther and Veloso, 2000] W. T. B. Uther and M. M. Veloso. The Lumberjack Algorithm for Learning Linked Decision Forests. In Proc. 6th Pacific Rim International Conf. on Artificial Intelligence (PRICAI’2000), Lecture Notes in Artificial Intelligence (LNAI) 1886 (Springer), pages 156–166, 2000. [Vapnik, 1995] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. [Visser and Dowe, 2007] Gerhard Visser and D. L. Dowe. Minimum message length clustering of spatially-correlated data with varying inter-class penalties. In Proc. 6th IEEE International Conf. on Computer and Information Science (ICIS) 2007, pages 17–22, July 2007. [Visser et al., 2009a] Gerhard Visser, D. L. Dowe, and I. D. Svalbe. Information-theoretic image reconstruction and segmentation from noisy projections. In Lecture Notes in Artificial Intelligence (Proc. 22nd Australian Joint Conf. on Artificial Intelligence [AI’09]), pp. 170–179. Springer, December 2009. [Visser et al., 2009b] Gerhard Visser, D. L. Dowe, and J. Petteri Uotila. Enhancing MML clustering using context data with climate applications. In Lecture Notes in Artificial Intelligence (Proc. 22nd Australian Joint Conf. on Artificial Intelligence [AI’09]), pp. 350–359. Springer, December 2009.
980
David L. Dowe
[Viswanathan and Wallace, 1999] M. Viswanathan and C. S. Wallace. A note on the comparison of polynomial selection methods. In D Heckerman and J Whittaker, editors, Proceedings of Uncertainty 99: The Seventh International Workshop on Artificial Intelligence and Statistics, pages 169–177, Fort Lauderdale, Florida, USA, January 1999. Morgan Kaufmann Publishers, Inc., San Francisco, CA, USA. [Viswanathan et al., 1999] Murlikrishna Viswanathan, C. S. Wallace, David L. Dowe, and Kevin B. Korb. Finding cutpoints in noisy binary sequences — a revised empirical evaluation. In Proc. 12th Australian Joint Conference on Artificial Intelligence, volume 1747 of Lecture Notes in Artificial Intelligence, pages 405–416. Springer Verlag, 1999. [Wallace, 1973a] C. S. Wallace. Simulation of a two-dimensional gas. In Proc. of the Australian & New Zealand Association for the Advancement of Science (ANZAAS) Conf., page 19, August 1973. abstract. [Wallace, 1984b] C. S. Wallace. An improved program for classification. Technical Report 47, Department of Computer Science, Monash University, Clayton, Victoria 3168, Australia, Melbourne, 1984. [Wallace, 1984a] C. S. Wallace. Inference and estimation by compact coding. Technical Report 84/46, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, August 1984. [Wallace, 1986] C. S. Wallace. An improved program for classification. In Proc. of the 9th Australian Computer Science Conference (ACSC-9), pages 357–366, February 1986. Published as Proc. of ACSC-9, volume 8, number 1. [Wallace, 1990c] C. S. Wallace. Classification by minimum-message-length encoding. In S. G. Akl et al, editor, Advances in Computing and Information — ICCI ’90, volume 468 of Lecture Notes in Computer Science (LNCS), pages 72–81. Springer-Verlag, May 1990. [Wallace, 1990b] C S Wallace. Classification by minimum-message-length inference. In Working Notes AAAI Spring Symposium Series, pages 65–69. Stanford Uni., Calif., U.S.A., 1990. [Wallace, 1992] C. S. Wallace. A Model of Inductive Inference. Seminar, November 1992. Also on video, Dept. of Computer Science, Monash University, Clayton 3168, Australia, Wed. 25 Nov. 1992. [Wallace, 1995a] C. S. Wallace. Multiple Factor Analysis by MML Estimation. Technical report CS TR 95/218, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, Clayton, Melbourne, Australia, 1995. [Wallace, 1996c] C. S. Wallace. False oracles and SMML estimators. In D. L. Dowe, K. B. Korb, and J. J Oliver, editors, Proceedings of the Information, Statistics and Induction in Science (ISIS) Conference, pages 304–316, Melbourne, Australia, August 1996. World Scientific. ISBN 981-02-2824-4. Was previously Tech Rept 89/128, Dept. Comp. Sci., Monash Univ., Australia, June 1989. [Wallace, 1996b] C. S. Wallace. MML inference of predictive trees, graphs and nets. In A. Gammerman, editor, Computational Learning and Probabilistic Reasoning, chapter 3, pages 43–66. Wiley, 1996. [Wallace, 1997] C. S. Wallace. On the selection of the order of a polynomial model. Technical report, Royal Holloway College, England, U.K., 1997. Chris released this in 1997 (from Royal Holloway) in the belief that it would become a Royal Holloway Tech Rept dated 1997, but it is not clear that it was ever released there. Soft copy certainly does exist, though. Perhaps see www.csse.monash.edu.au/∼dld/CSWallacePublications. [Wallace, 1998d] C. S. Wallace. Competition isn’t the only way to go, a Monash FIT graduation address, April 1998. (Perhaps see www.csse.monash.edu.au/∼dld/CSWallacePublications). [Wallace, 1998a] C. S. Wallace. Intrinsic classification of spatially correlated data. Computer Journal, 41(8):602–611, 1998. [Wallace, 1998b] C. S. Wallace. Multiple factor analysis by MML estimation. In Proceedings of the Fourteenth Biennial Australian Statistical Conference (ASC-14), page 144, Queensland, Australia, July 1998. [Wallace, 1998c] C. S. Wallace. On the selection of the order of a polynomial model. In W. Robb, editor, Proc. of the 14th Biennial Australian Statistical Conf., page 145, Queensland, Australia, July 1998. [Wallace, 1998e] C. S. Wallace. PAKDD-98 Tutorial: Data Mining, 15-17 April 1998. Tutorial entitled “Data Mining” at the 2nd Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining (PAKDD-98), held in Melbourne, Australia. This
MML, Hybrid Bayesian Network Graphical Models, ...
981
partly constituted an early draft of Chris Wallace’s 2005 book “Statistical and Inductive Inference by Minimum Message Length”. [Wallace, 1999] C. S. Wallace. The MIT Encyclopedia of the Cognitive Sciences (MITECS), chapter Minimum description length, (major review), pages 550–551. The MIT Press, London, England, ISBN: 0-262-73124-X, 1999. [Wallace, 2005] C. S. Wallace. Statistical and Inductive Inference by Minimum Message Length. Information Science and Statistics. Springer Verlag, May 2005. ISBN 0-387-23795X. [Wallace and Boulton, 1968] C. S. Wallace and D. M. Boulton. An information measure for classification. Computer Journal, 11(2):185–194, 1968. [Wallace and Boulton, 1975] C. S. Wallace and D. M. Boulton. An invariant Bayes method for point estimation. Classification Society Bulletin, 3(3):11–34, 1975. [Wallace and Dale, 2005] C. S. Wallace and M. B. Dale. Hierarchical clusters of vegetation types. Community Ecology, 6(1):57–74, 2005. ISSN: 1585-8553. [Wallace and Dowe, 1993] C. S. Wallace and D. L. Dowe. MML estimation of the von Mises concentration parameter. Technical Report 93/193, Dept. of Computer Science, Monash University, Clayton 3168, Australia, December 1993. [Wallace and Dowe, 1994a] C. S. Wallace and D. L. Dowe. Estimation of the von Mises concentration parameter using minimum message length. In Proc. 12th Aust. Stat. Soc. Conf., 1994. 1 page abstract. [Wallace and Dowe, 1994b] C. S. Wallace and D. L. Dowe. Intrinsic classification by MML — the Snob program. In Proc. 7th Australian Joint Conf. on Artificial Intelligence, pages 37–44. World Scientific, November 1994. [Wallace and Dowe, 1996] C. S. Wallace and D. L. Dowe. MML mixture modelling of Multistate, Poisson, von Mises circular and Gaussian distributions. In Proc. Sydney International Statistical Congress (SISC-96), page 197, Sydney, Australia, 1996. [Wallace and Dowe, 1997a] C. S. Wallace and D. L. Dowe. MML mixture modelling of multistate, Poisson, von Mises circular and Gaussian distributions. Proc 28th Symp. on the Interface, pages 608–613, 1997. [Wallace and Dowe, 1997b] C. S. Wallace and D. L. Dowe. MML mixture modelling of multistate, Poisson, von Mises circular and Gaussian distributions. In Sixth International Workshop on Artificial Intelligence and Statistics, Society for AI and Statistics, pages 529–536, San Francisco, USA, 1997. [Wallace and Dowe, 1999a] C. S. Wallace and D. L. Dowe. Minimum message length and Kolmogorov complexity. Computer Journal, 42(4):270–283, 1999. [Wallace and Dowe, 1999b] C. S. Wallace and D. L. Dowe. Refinements of MDL and MML coding. Computer Journal, 42(4):330–337, 1999. [Wallace and Dowe, 1999c] C. S. Wallace and D. L. Dowe. Rejoinder. Computer Journal, 42(4):345–347, 1999. [Wallace and Dowe, 2000] C. S. Wallace and D. L. Dowe. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. Statistics and Computing, 10:73–83, January 2000. [Wallace and Freeman, 1987] C. S. Wallace and P. R. Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society series B, 49(3):240–252, 1987. See also Discussion on pp252-265. [Wallace and Freeman, 1992] C. S. Wallace and P. R. Freeman. Single-factor analysis by minimum message length estimation. J. Royal Stat. Soc. B, 54(1):195–209, 1992. [Wallace and Georgeff, 1983] C. S. Wallace and M. P. Georgeff. A general objective for inductive inference. Technical Report #83/32, Department of Computer Science, Monash University, Clayton, Australia, March 1983. Reissued in June 1984 as TR No. 44. [Wallace and Korb, 1994] C. S. Wallace and K. B. Korb. A Bayesian learning agent. In C. S. Wallace, editor, Research conference: Faculty of Computing and Information Technology, page 19. Monash University Melbourne, 1994. [Wallace and Korb, 1997] C. S. Wallace and K B Korb. Learning linear causal models by MML sampling. Technical report CS 97/310, Dept Computer Science, Monash University, Melbourne, Australia, 1997. [Wallace and Korb, 1999] C. S. Wallace and K. B. Korb. Learning linear causal models by MML sampling. In A. Gammerman, editor, Causal Models and Intelligent Data Management, pages 89–111. Springer Verlag, 1999. see TR 97/310, Dept. Comp. Sci., Monash Univ., Australia, June 1997.
982
David L. Dowe
[Wallace et al., 1996b] C. S. Wallace, K B Korb, and H Dai. Causal discovery via MML. Technical report CS 96/254, Dept Computer Science, Monash University, Melbourne, Australia, 1996. [Wallace et al., 1996a] C. S. Wallace, Kevin B. Korb, and Honghua Dai. Causal discovery via MML. In 13th International Conf. on Machine Learning (ICML-96), pages 516–524, 1996. [Wallace and Patrick, 1991] C. S. Wallace and J D Patrick. Coding decision trees. Technical report CS 91/153, Dept Computer Science, Monash University, Melbourne, Australia, 1991. [Wallace and Patrick, 1993] C. S. Wallace and J. D. Patrick. Coding decision trees. Machine Learning, 11:7–22, 1993. [Zakis et al., 1994] J. D. Zakis, I. Cosic, and D. L. Dowe. Classification of protein spectra derived for the resonant recognition model using the minimum message length principle. Australian Comp. Sci. Conf. (ACSC-17), pages 209–216, January 1994.
SIMPLICITY, TRUTH, AND PROBABILITY Kevin T. Kelly
1
INTRODUCTION
Scientific theories command belief or, at least, confidence in their ability to predict what will happen in remote or novel circumstances. The justification of that trust must derive, somehow, from scientific method. And it is clear, both from the history of science and from the increasing codification and automation of the scientific method both in statistics and in machine learning, that a major component of that method is Ockham’s razor, a systematic bias toward simple theories, where “simplicity” has something to do with minimizing free parameters, gratuitous entities and causes, independent principles and ad hoc explanations and with maximizing unity, testability, and explanatory power. Ockham’s razor is not a bloodless, formal rule that must be learned — it has a native, visceral grip on our credence. For a celebrated example, Copernicus was driven to move the earth to eliminate five epicycles from medieval astronomy [Kuhn, 1957]. The principal problem of positional planetary astronomy was to account for the apparently irregular, retrograde or backward motion of the planets against the fixed stars. According to the standard, Ptolemaic theory of the time, retrograde motion results from the planet revolving around an epicycle or circle whose center revolves, in turn, on another circle called the deferent, centered on the earth. Making the epicycle revolve in the same sense as the deferent implies that the planet should be closest or brightest at the midpoint of its retrograde motion, which agrees with observations. Copernicus explained retrograde motion in terms of the moving earth being lapped or lapping the other planets on a cosmic racetrack centered on the sun, which eliminates one epicycle per planet (figure 1). Copernicus still required many superimposed circles to approximate elliptical orbits, so the mere elimination of five such circles may not seem very impressive. But there is more to the story than just counting circles. It happens that the retrograde motions of Mars, Jupiter, and Saturn occur precisely when the respective planet is in solar opposition (i.e., is observed 180◦ from the sun) and that the retrograde motions of Mercury and Venus occur at solar conjuction (i.e., when the respective planet is 0◦ from the sun). Ptolemy’s epicycles can be adjusted to recover the same effect, but only in a rather bizarre manner. Think of the line from the earth to the sun as the hand of a clock and think of the line from the center of Saturn’s epicycle to Saturn as the hand of another clock. Then retrograde motion happens exactly at solar opposition if and only if Saturn’s epicycle clock is Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
984
Kevin T. Kelly
deferent
θ
θ epicycle
Figure 1. Ptolemy vs. Copernicus perfectly synchronized with the sun’s deferent clock. The same is true of Mars and Jupiter. Furthermore, Mercury and Venus undergo retrograde motion exactly at solar opposition just in case their deferent clocks are perfectly synchronized with the sun’s deferent clock. In Ptolemy’s theory, these perfect synchronies across vast distances in the solar system appear bizarre and miraculous. On Copernicus’ theory, however, they are ineluctable, geometrical banalities: the earth passes an outer planet exactly when the earth crosses the line from the central sun to the planet passed. So Copernicus’ theory crisply explains the striking synchronies. Copernicus’ theory is also severely tested by the synchronies, since it would be refuted by any perceived deviation from exact synchrony, however slight. Ptolemy’s theory, on the other hand, merely accommodates the data in an ad hoc manner by means of its adjustable parameters. It seems that Copernicus’ theory should get some sort of reward for surviving a test shirked by its competitor. One could add clockwork gears to Ptolemy’s theory to explain the synchronies, but that would be an extra principle receiving no independent confirmation from other evidence. Copernicus’ explanation, on the other hand, recovers both retrograde motion and its correlation with solar position from the geometry of a circular racetrack, so it provides a unified explanation of the two phenomena. Empirical simplicity is more than mere notational brevity — it implies such red-blooded considerations as explanatory power [Harman, 1965], unity [Kitcher, 1982], independently confirmable principles [Friedman, 1983; Glymour, 1980] and severe testability [Popper, 1968; Mayo, 1996]. Another standard example of Ockham’s razor in action concerns the search for empirical laws (figure 2). Any finite number of observations can be connected with a polynomial curve that passes through each, but we may still prefer a straight line that comes close to each point. It is, perhaps, more tempting in this case to identify simplicity with syntactic length or complexity of the law, since α0 x0 + α1 x1 + . . . αn xn is obviously more verbose than α0 x0 + α1 x1 . But one can also say that the complex law merely accommodates the data by having an independent, adjustable parameter for each data point, whereas when the two parameters of the simple law can be estimated with a few data, providing an explanation of the remaining data points. The complex law is also less unified than the simple law
Simplicity, Truth, and Probability
985
Figure 2. Inferring polynomial degree
(the several coefficients receive isolated support from the data points they are set to account for) and is less visually “uniform” than the simple law. Ockham’s razor does the heavy lifting in scientific theory choice, for no other principle suffices to winnow the infinite range of possible explanations of the available data down to a unique one. And whereas simplicity was once the theorist’s personal prerogative, it is now a mathematically explicit and essential component of contemporary statistical and computational techniques for drawing conclusions from empirical data (cf. [Mitchell, 1977, Duda et al., 2001]). The explicitness and indispensability of Ockham’s razor in scientific theory selection raises a natural question about its justification. Epistemic justification is not just a word or a psychological urge or a socially sanctioned, exculpatory ritual or procedure. It should imply some sort of truth-conduciveness of the underlying process by which one’s trust is produced. An attractively ambitious concept of truth-conduciveness is reliable indication of the truth, which means that the process has a high chance of producing the true theory, whatever the truth happens to be, the way a properly functioning thermometer indicates temperature. But Ockham’s razor is more like a trick thermometer whose reading never changes. Such a thermometer cannot be said to indicate the temperature even if its fixed reading happens to be true. Neither can a fixed bias toward simplicity immediately indicate the truth about nature — unless the truth is alreadyknown to be simple, in which case there would be no need to invoke Ockham’s razor by way of justification.1 Ockham’s razor has a good excuse for failing to reliably indicate true theories, since theory choice requires inductive inference and no inductive inference method can be a truth-indicator: each finite set of data points drawn with bounded precision from a linear law is also compatible with a sufficiently flat parabola, so no possible data-driven process could reliably indicate, in the short run, whether the truth is linear or quadratic. A more feasible concept of truth-conduciveness for inductive inference is convergence in the limit, which means that the chance that the method produces the true theory converges to one, no matter what the true 1 This updated version of Plato’s Meno paradox is underscored in machine learning by the “no free lunch theorems” [Wolpert, 1996].
986
Kevin T. Kelly
theory might be.2 Convergence to the truth in the limit is far weaker than shortrun truth-indication, since it is compatible with the choice of any finite number of false theories with arbitrarily high chance before settling on the correct one. Each time a new theory Tn+1 is produced with high chance, the chance of producing the previous candidate Tn must drop precipitously and one may say that the output is retracted. So convergence in the limit differs from reliable indication by allowing any finite number of arbitrarily precipitous retractions prior to “locking on” to the right answer. Assuming that the true theory is polynomial, Ockham’s razor does converge in the limit to the true polynomial degree of f (x) — each polynomial degree lower than the true degree is ruled out, eventually, by the data (e.g., when new bumps in the true law become noticable), after which the true theory is the simplest theory compatible with experience. Think of successively more complex theories as tin cans lined up on a fence, one of which (the true one) is nailed to the fence. Then, if one shoots the cans from left to right, eventually the nailed can becomes and remains the first can in line that has not yet been shot down. The familiar trouble with this explanation of Ockham’s razor is that convergence in the long run is compatible reliance on any alternative bias for any finite duration [Salmon, 1967]. For example, guess an equation of degree 10 with the hope that the coefficient is so large that the thousand bumps will be noticed early — say in a sample of size 1000. If they aren’t seen by then, revert back to Ockham’s razor, which succeed in the limit. Hence, convergence in the limit is feasible in theoretical inference, but it does not single out simple theories as the right theories to produce in the short run. To summarize, the justification of Ockham’s razor poses a puzzle. Ockham’s razor can’t reliably indicate the true theory in the short run, due to the problem of induction. And although Ockham’s razor does converge to the truth in the ideal limit of inquiry, alternative methods producing very complex theories are also truth-conducive in that very weak sense as well [Salmon, 1967]. So short-run indication is too strong to be feasible and long-run convergence is too weak to single out Ockham’s razor. It remains, therefore, to define a sense of truth-conduciveness according to which it can be argued, without circularity, that Ockham’s razor helps one find the truth better than alternative methods that would produce arbitrarily complex theories now. Absent such a story, Ockham’s razor starts to look like an exercise in wishful thinking — the epistemic sin of inferring that reality is simple because the true theory of a simple world would have pragmatic virtues (e.g., explanatory power) that one would like to have. Such doubts motivate a skeptical or anti-realist attitude toward scientific theories in general [van Fraassen, 1981]. This paper reviews the standard explanations of Ockham’s razor, which fall into two main groups. The first group invokes a tacit, prior bias toward simplicity, which begs the question in favor of Ockham’s razor. The second group substitutes a particular notion of predictive accuracy for truth, based on the surprising fact that a false theory may make more accurate predictions than the true one when 2 This concept is called convergence in probability in probability theory and consistency in statistics.
Simplicity, Truth, and Probability
987
the truth is complex. That evidently fails to explain how Ockham’s razor finds true theories (as opposed to useful models). Furthermore, when predictions concern the outcomes of interventions on the world, even the argument for predictive accuracy fails.3 Since neither approach really explains how Ockham’s razor leads to true theories or even to accurate policy predictions, the second part of the paper develops an entirely new explanation: Ockham’s razor does not point at the truth, even with high probability, but it does help one arrive at the truth with uniquely optimal efficiency, where efficiency is measured in terms of such epistemically pertinent considerations as the total number of errors and retractions of prior opinions incurred before converging to the truth and the elapsed times by which the retractions occur. Thus, in a definite sense, Ockham’s razor is demonstrably the uniquely most truth-conducive method for inferring general theories from particular facts — even though no possible method can be guaranteed to point toward the truth with high probability in the short run. 2
THE ARGUMENT FROM BAYES FACTORS
Bayesian statisticians assign probability-valued degrees of belief to all the propositions in some language and then “rationally” update those degrees of belief by a universal rule called conditionalization.4 If pt (T ) is your prior degree of belief that T at stage t and if E is new evidence received at stage t+1, then conditionalization says that your updated degree of belief that T at t + 1 should be: pt+1 (T ) = pt (T | E). It follows from the conditionalization rule that: pt+1 (T ) = (pt (T ) · pt (E | T ))/pt (E). An important feature of the rule is that one’s updated degree of belief pt+1 depends on one’s prior degree of belief pt (T ), which might have been strongly biased for or against T prior to collecting any evidence about T whatever. That feature suggests an easy “justification” of Ockham’s razor — just start out with prior probabilities biased toward simple theories. Then, if simple theories explain the data about as well as complex ones, the prior bias toward the simple theory passes through the updating procedure. But to invoke a prior bias toward simplicity to explain a prior bias toward simplicity evidently begs the main question at hand. A more promising Bayesian argument for Ockham’s razor centers not on the prior probability pt (T ), but on the term pt (E | T ), which corresponds to the rational credence conferred on E by theory T . (cf. [Jeffreys, 1961; Rosenkrantz, 3 For
candid discussions of the shortcomings of the usual explanations of Ockham’s razor as it is used in machine learning, cf., for example, [Domingos, 1999] and [Mitchell, 1997]. 4 Not all Bayesians accept updating by conditionalization. Some Bayesians recommend accepting hypotheses altogether, in which case the degree of belief goes to one. Others recommend updating on partially believed evidence. Others recommend updating interval-valued degrees of belief, etc. Others reject its coherentist justification in terms of diachronic Dutch books.
988
Kevin T. Kelly
1983; Myrvold, 2003]). According to this explanation, Ockham’s razor does not demand that the simpler theory T1 start out ahead of its complex competitor T2 ; it suffices that T1 pull ahead of T2 when evidence E compatible with T1 is received. That sounds impressive, for the conditional probability pt (E | T ) is often thought to be more objective than the prior probability pt (T ), because pt (E | T ) reflects the degree to which T “explains” E. But that crucially overstates the case when T has free parameters to adjust, as when Ockham’s razor is at issue. Thoroughly subjective Bayesians interpret “objective” probabilities as nothing more than relatively inter-subjective degrees of belief, but a more popular, alternative view ties objectivity to chances. Chances are supposed to be natural, objective probabilities that apply to possible outcomes of random experiments. Chance will be denoted by a capital P , in contrast with the lower-case p denoting degrees of belief. Bayesian statisticians link chances to evidence and to action by means of the direct inference principle [Kyburg, 1977; Levi, 1977; Lewis, 1987], which states that degrees of belief should accord with known chances, given only admissible5 information E ′ : pt (E | P (E) = r ∧ E ′ ) = r. If theory T says exactly that the true chance distribution of X is P , then by the direct inference principle: pt (E | T ) = P (E), which is, indeed, objective. But if T is complex, then T has adjustable parameters and, hence, implies only that the true chance distribution lies in some set, say: {P1 , . . . , Pn }. Then the principle of direct inference yields the weighted average: pt (E | T ) =
n X i=1
Pi (E) · pt (Pi | T ),
in which the weights pt (Pn | T ) are prior degrees of belief, not chances. So the objective-looking quantity pt (E | T ) is loaded with prior opinion when T is complex and that potentially hidden fact is crucial to the Bayesian explanation of Ockham’s razor. A standard technique for comparing the posterior probabilities of theories is to look at the posterior ratio: pt (T1 | E) pt (T1 ) pt (E | T1 ) = · . pt (T2 | E) pt (T2 ) pt (E | T1 ) The first quotient on the right-hand-side is the prior ratio, which remains constant as new evidence E is received. The second quotient is the Bayes factor, which accounts for the entire impact of E on the relative credence of the two theories [Kass and Raftery, 1995]. To guarantee that p(T1 | E) > p(T2 | E), one must impose some further, material restrictions on coherent degrees of belief, but it can be argued that the constraints 5 Defining
admissibility is a vexed question that will be ignored here.
Simplicity, Truth, and Probability
989
are presupposed by the very question whether Ockham’s razor should be used when choosing between a simple and a complex theory. That places the Bayesian explanation of Ockham’s razor in in the same class of a priori metaphysical arguments that includes Descartes’ cogito, according to which the thesis “I exist” is evidently true each time one questions it. First of all, a Bayesian wouldn’t think of herself as choosing between T1 and T2 if she started with a strong bias toward one theory or the other, so let pt (T1 ) ≈ pt (T2 ). Second, she wouldn’t be choosing between two theories compatible with E unless simple theory T1 explains E, so that P (E) ≈ 1. Third, she wouldn’t say that T2 is complex unless T2 has a free parameter i to adjust to save the data. She would not say that the parameter of T2 is free unless she were fairly uncertain about which chance distribution Pi would obtain if T2 were true: e.g. pt (Pi | T2 ) ≈ 1/n. Furthermore, she would not say that the parameter must be adjusted to save E unless the chance of E is high only over a narrow range of possible chance distributions compatible with T2 : e.g., P0 (E) ≈ 1 and for each alternative i such that 0 < i ≤ n, pi (E) ≈ 0. It follows from the above assumptions that the prior probability ratio is approximately 1 and the Bayes’ factor is approximately n, so: pt (T1 | E) ≈ n. pt (T2 | E) Thus, the simple theory T1 ends up far more probable than the complex theory T2 in light of evidence E, as the complex theory T2 becomes more “adjustable”, which is the argument’s intended conclusion. When the set of possible chance distributions {Pθ : θ ∈ R} is continuously parameterized, the argument is similar, except that the (discrete) weighted sum expressing pt (E | T2 ) becomes a (continuous) integral: Z pt (E | T2 ) = Pθ (E) · pt (Pθ | T2 ) dθ,
which, again, is weighted by the subjective degrees of belief pt (Pθ | T2 ). Each of the above assumptions can be weakened. It suffices that the prior ratio not favor T2 too much, that the explanation of E by T1 not be too vague, that the explanation of E by T2 not be too robust across parameter values and that the distribution of degrees of belief over free parameters of T2 not be focused too heavily on the parameter values that more closely mimic the predictions of T1 . The Bayes factor argument for Ockham’s razor is closely related to standard paradoxes of indifference. Suppose that someone is entirely ignorant about the color of a marble in a box. Indifference over the various colors implies a strong bias against blue in the partition blue vs. non-blue, whereas indifference over blue vs. non-blue implies a strong bias against yellow. The Bayes factor argument amounts to plumping for the former bias. Think of the simple theory T0 as “blue” and of the complex theory T2 as “non-blue” with a “free parameter” ranging over red, green, yellow, etc. and assume, for example, that the evidence E is “either blue or red”. Then, by the above calculation, the posterior ratio of “blue” over “non-blue” is the number n of distinguished non-blue colors. Now consider the
990
Kevin T. Kelly
underlying prior probability over the refined partition blue, red, green, yellow, etc. It is apparent that “blue” is assigned prior probability 1/2, whereas each alternative color is assigned 1/2n, where n > 1. Hence, the complex theory starts out even with the simple theory, but each complex possibility starts out with a large disadvantage. Thus, although “red” objectively “explains” E just as well as “blue” does, the prior bias for “blue” over “red” gets passed through the Bayesian updating formula and begs the question in favor of“blue”. One could just as well choose to be “ignorant” over blue, red, green, yellow, etc., in which case “blue” and “red” end up locked in a tie after E is observed and “non-blue” remains more probable than “blue”. So the Bayes factor argument again comes down to a question-begging prior bias in favor of simple possibilities. One can attempt to single out the simplicity bias by expanding the Bayesian notion of rationality to include “objective” constraints on prior probability: e.g., by basing them on the length of Turing machine programs that would produce the data or type out the hypothesis [Jeffreys, 1961; Rissannen, 2007; Li and Vitanyi, 1993]. But that strategy is an epistemological red herring. Even if “rationality” is augmented to include an intuitively appealing, formal rule for picking out some prior biases over others, the real question regarding Ockham’s razor is whether such a bias helps one find the truth better than alternative biases (cf. [Mitchell, 1997]). To answer that question relevantly, one must explain, without circular appeal to the very bias in question, whether and in what sense Bayesians who start with a prior bias toward simplicity find the truth better than Bayesians starting with alternative biases would. There are two standard strategies for justifying Bayesian updating. Dutch Book arguments show that violating the Bayesian updating rule would result in preference for combinations of diachronic bets that result in a sure loss over time [Teller, 1976]. But such arguments do not begin to establish that Bayesian updating leads to higher degrees of belief in true theories in the short run. In fact, Bayesian updating can result in a huge short-run boost of credence in a false theory: e.g., when the the parameters of the true, complex theory are set very close to values that mimic observations fitting a simple alternative. Perhaps the nearest that Bayesians come to taking theoretical truth-conduciveness seriously is to argue that iterated Bayesian updating converges to the true theory in the limit, in the sense that p(T | En ) converges to the truth value of T as n increases.6 But the main shortcoming with that approach has already been discussed: both Ockham and non-Ockham initial biases are compatible with convergent success in the long run. In sum, Bayesians either beg the question in favor of simplicity by assigning higher prior probability to simpler possibilities, or they ignore truthconduciveness altogether in favor of arguments for coherence, or they fall back upon the insufficient strategy of appealing to long-run convergence.
6 Even then, convergence is guaranteed only with unit probability in the agent’s prior probability. The non-trivial consequences of that slip are reviewed in [Kelly, 1996].
Simplicity, Truth, and Probability
3
991
THE ARGUMENT FROM OVER-FITTING
Classical statisticians seek to justify scientific method entirely in terms of objective chances, so the Bayesian explanation of Ockham’s razor in terms of Bayes factors and prior probabilities is not available to them. Instead, they maintain a firm focus on truth-conduciveness but lower their sights from choosing the true theory to choosing the theory that yields the most accurate predictions. If theory T is deterministic and observation is perfectly reliable and T has no free parameters, then prediction involves deducing what will happen from T . If T has a free parameter θ, then one must use some empirical data to fix the true value of θ, after which one deduces what will happen from T (e.g., two observed points determine the slope and intercept of a linear law). More generally, fixing the parameter values of T results only in a chance distribution Pθ over possible experimental outcomes. In that case, it is natural to use past experimental data E ′ to arrive at an empirical b E ′ ) of parameter θ. A standard estimation technique is to define estimate θ(T, b E ′ ) to be the value of θ that maximizes Pθ (E ′ ). Then θ(T, b E ′ ) is called the θ(T, maximum likelihood estimate or MLE of T (given outcome E ′ ) and the chance distribution Pθ(T,E ′ ) is a guess at the probability of future experimental outcomes b E. The important point is that theory T is not necessarily inferred or believed in this procedure; The aim in choosing T is not to choose the true T but, rather, the ∗ T that maximizes the accuracy of the estimate Pθ(T,E ′ ) of P . Classical statistib cians underscore their non-inferential, instrumentalistic attitude toward statistical theories by calling them models. It may seem obvious that no theory predicts better than the true theory, in which case it would remain mysterious why a fixed bias toward simplicity yields more accurate predictions. However, if the data are random, the true theory is complex, the sample is small, and the above recipe for using a theory for predictive purposes is followed, then a false, overly simplified theory can predict more accurately than the true theory — e.g., even if God were to inform one that the true law is a degree 10 polynomial, one might prefer, on grounds of predictive accuracy, to derive predictions from a linear law. That surprising fact opens the door to an alternative, non-circular explanation of Ockham’s razor in terms of predictive accuracy. The basic idea applies to accuracy in general, not just to accurate prediction. Consider, for example, a marksman shooting at a target. To keep our diagrams as elementary as possible, assume that the marksman is a Flatlander who exists entirely in a two-dimensional plane, so that the target is one-dimensional. There is a wall (line) in front of the marksman and the bull’s eye is a distinguished point θ∗ on that line. Each shot produced by the marksman ˆ so it is natural to define the squared error of shot hits the wall at some point θ, ∗ 2 ˆ ˆ θ as (θ − θ ) (figure 3.a). Then for n shots, the average of the squared errors of the n points is a reflection of the marksman’s accuracy, because the square function keeps all the errors positive, so none of them cancel.7 If one thinks of 7 One could also sum the absolute values of the errors, but the square function is far more commonly used in statistics.
992
Kevin T. Kelly
θ*
θ*
weld steel post
(a)
(b) Figure 3. Firing range
the marksman’s shots as being governed by a probability distribution reflecting all the stray causes that affect the marksman on a given shot, then one can explicate the marksman’s accuracy as the expected or mean squared error (MSE) of a single shot with respect to distribution P : ˆ θ∗ ) = Exp (θˆ − θ∗ )2 . M SEP (θ, P The MSE is standardly factored in a revealing way into a formula known as the bias-variance trade-off [Wasserman, 2004]: ˆ θ∗ ) = BiasP (θ, ˆ θ∗ )2 + VarP (θ), ˆ MSEP (θ, ˆ θ∗ ) is defined as the deviation of the marksman’s average or exwhere BiasP (θ, pected shot from the bull’s eye θ∗ : ˆ θ∗ ) = Exp (θ) ˆ − θ∗ ; BiasP (θ, P ˆ is defined as the expected distance of a shot from the and the variance Varp (θ) average shot: ˆ 2 ). VarP (θ) = ExpP ((θˆ − ExpP (θ)) Bias is a systematic tendency to hit to a given side of the bull’s eye, whereas variance reflects spread around the marksman’s expected or average shot. Even the best marksman is subject to some variance due to pulse, random gusts of wind, etc., and the variance is amplified systematically as distance from the target increases. In contrast, diligent aim, proper correction of vision, etc. can virtually eliminate bias, so it seems that a marksman worthy of the name should do everything possible to eliminate bias. But that argument is fallacious. Consider the extreme strategy of welding the rifle to a steel post to eliminate variance altogether (figure 3.b). In light of the bias-variance trade-off, the welded rifle is more accurate than honest aiming as long as the squared bias of the welded rifle is less than the variance of the marksman’s unconstrained aim. If variance is sufficiently high (due to distance
Simplicity, Truth, and Probability
993
from the target, for example), the welded rifle can be more accurate, in the MSE sense, than skillful, unrestricted aim even if the weld guarantees a miss. That is the key insight behind the over-fitting argument. Welding the rifle to a post is draconian. One can imagine a range of options, from the welded rifle, through various, successively less constraining clamps, to unconstrained aim. For a fixed position θ∗ of the bull’s eye, squared bias goes down and variance goes up as aiming becomes less constrained. The minimum MSE (among options available) occurs at a “sweet spot” where the sum of the curves achieves a mininum. Aiming options that are sub-optimal due to high bias are said to under-aim (the rifle’s aim is too constrained) and aiming options that are sub-optimal due to high variance are said to over-aim (the rifle’s aim is not constrained enough). So far, the welded rifle strategy looks like a slam-dunk winner over all competing strategies — just hire an accurate welder to obtain a perfect score! But to keep the contest sporting, the target can be concealed behind a curtain until all the welders complete their work. Now the welded rifles still achieve zero variance or spread, but since bias depends on the bull’s eye position θ∗ , which might be anywhere, the welding strategy cannot guarantee any bound whatever on bias. The point generalizes to other, less draconian constraints on aim — prior to seeing the target there is no guarantee how much extra bias such constraints would contribute to the shot. One could lay down a prior probability reflecting about where the organizers might have positioned the target, but classical statisticians refuse to consider them unless they are grounded in knowledge of objective chance. Empirical prediction of random quantities is closely analogous to a shooting contest whose target is hidden in advance. The maximum likelihood estimate ˆ E ′ ) is a function of random sample E ′ and, hence, has a probability distribuθ(T, tion P ∗ that is uniquely determined by the true, sampling distribution Pθ∗ . Thus, ˆ E ′ ) is like a stochastic shot θˆ at bull’s eye θ∗ . When the MLE is taken with θ(T, respect to the completely unconstrained theory T1 = {Pθ : θ ∈ Θ}, it is known ˆ 1 , E ′ ), θ∗ ) = 0. in many standard cases that the MLE is unbiased: i.e., BiasP ∗ (θ(T Thus, the MLE based on the complex, unconstrained theory is like the marksman’s free aim at the bull’s eye. How can that be, when the scientist can’t see the bull’s eye θ∗ she is aiming at? The answer is that nature aims the rifle straight at θ∗ ; the scientist merely chooses whether the rifle will be welded or not and then records the result of the shot. Similarly, the MLE with respect to constrained theory T0 = {Pθ0 } is like shooting with the welded rifle — it has zero variance but no guarantee whatever regarding bias. For a fixed parameter value θ∗ and for theories ordered by increasing complexity, there is a “sweet spot” theory T that maximizes accuracy by optimally trading bias for variance. Using a theory simpler than T reduces accuracy by adding extra bias and is called under-fitting whereas using a theory more complex or unconstrained than T reduces accuracy by adding variance and is called over-fitting. Note that over-fitting is defined in terms of the bias-variance trade-off, which is relative to sample size, and definitely not in terms of distinguishing genuine trends from mere noise, as some motivational discussions
994
Kevin T. Kelly
seem to suggest (e.g., [Forster and Sober, 1994]). To assume a priori that θ0 is sufficiently close to θ∗ for the MLE based on T0 to be more accurate than the MLE based on T1 is just another way to beg the question in Ockham’s favor. But the choice between basing one’s MLE on T0 or on T1 is a false dilemma — Ockham’s razor says to presume no more complexity than necessary, rather than to presume no complexity at all, so it is up to Ockham’s razor to say how much complexity is necessary to accommodate sample E ′ . To put the same point another way, Ockham’s razor is not well-defined in statistical contexts until one specifies a formula that scores theories in a manner that rewards fit but taxes complexity. One such formula is the Akaike [1973] information criterion (AIC), which ranks theories (lower is better) relative to a given sample E ′ in terms of the remarkably tidy and suggestive formula: AIC(T, E) = badness of fit of T to E + complexity of T, where theoretical complexity is the number of free parameters in T and badness ′ 8 of fit is measured by: −ln(Pθ(T,E ′ ) (E )). ˆ Choosing T so as to minimize the AIC score computed from sample E ′ is definitely one way to strike a precise balance between simplicity and fit. The official theory behind AIC is that the AIC score is an unbiased estimate of a quantity whose minimization would minimize MSE [Wasserman, 2004]. That sounds remotely comforting, but it doesn’t cut to the chase. Ultimately, what matters is the MLE of the whole strategy of using AIC to choose a model and then computing the MLE of the model so chosen. To get some feel for the MLE of the AIC strategy, itself, it is instructive to return to the firing line. Recall that the MLE based on T0 is like a shot from the welded rifle that always hits point θ0 and the MLE based on T1 is like honest, unbiased aiming at the bull’s eye after the curtain rises. Using AIC to decide which strategy to employ has the effect of funneling shots that fall within a fixed distance r from θ0 exactly to θ0 — call r the funnel radius. So on the firing range, AIC could be implemented by making a sturdy funnel of radius r out of battleship plate and mounting it on a firm post in the field so that its spout lines up with the point θ0 (figure 4). The funnel is a welcome sight when the curtain over the target rises and θ0 is seen to line up with the bull’s eye θ∗ , because all shots caught by the funnel are deflected to more accurate positions. In that case, one would like the funnel to have an infinite radius so as to redirect every shot to the bull’s eye (which is decision-theoretically identical to welding the rifle to hit point θ0 ). The funnel is far less welcome, however, if the intended target is barely obscured by the edge of the funnel, for then then accurate shots get deflected or biased away from the bull’s eye, with possibly dire results if the target happens to be hostile (fig. 5). In that case, one would prefer the funnel to have radius 0 (i.e., to get rid of it altogether). 8 Recall that the MLE θ(T, ˆ E ′ ) is the value of free parameter θ in theory T that maximizes ′ ′ Pθ (E ′ ), so Pθ(T,E ′ ) (E ) is the best likelihood that can be obtained from T for sample E . Now ˆ recall that −ln drops monotonically from ∞ to 0 over the half-open interval (0, 1].
Simplicity, Truth, and Probability
accuracy improved
995
θ0 =θ *
accuracy improved
Figure 4. Ockham funnel, best case
θ* accuracy impaired
θ0 accuracy improved
Figure 5. Ockham funnel, worst case More generally, for each funnel radius r from 0 to infinity, one can plot the funnel’s MSE over possible bull’s eye positions θ∗ in order to portray the methods as decision-theoretic acts with MSE as the loss and θ as the state of the world (fig. 6).9 How, then, does one choose a funnel radius r? Proponents of AIC sometimes speak of typical or anomalous performance, but that amounts to a tacit appeal to prior probabilities over parameter values, which is out of bounds for classical statisticians when nothing is known a priori about the prior location of the bull’s eye. One prior-free decision rule is to eliminate dominated alternatives, but none of the options in figure 6 is dominated — larger funnels do better as θ∗ approaches θ0 and smaller ones do better as θ∗ diverges from θ0 . Another prior-free decision rule is to choose a minimax strategy, i.e., a strategy whose maximum MSE, over all 9 For
computer plots of such curves, cf. [Forster, 2001].
996
Kevin T. Kelly
MSE
r=
a
r = r0 r=0
b 0
θ0
θ∗
Figure 6. Ockham funnel decision problem possible values of θ∗ is minimal, over all alternative strategies under consideration. Alas, from figure 6, it is clear that the unique minimax solution among the available options is r = 0, which corresponds to estimation using the most complex theory — hardly a ringing endorsement for Ockham’s razor. There is, however, at least one prior-free decision rule that favors a non-extremal funnel diameter 0 < r < ∞. The regret of an option at θ is the difference between the MSE of the option at θ and the minimum MSE over all alternative options available at θ. The minimax regret option minimizes worst-case regret. As r goes to infinity, regret a goes up against r = 0 and as r goes to 0 the regret b goes up against r = ∞. So there must be a “sweet” value r∗ of r that minimizes a, b jointly and that yields a minimax regret solution. Then r∗ can be viewed as the right balance between simplicity and fit, so far as minimax regret with respect to predictive inaccuracy is concerned. In some applications, it can be shown that AIC is approximately the same as the minimax regret solution when the difference in model complexity is large [Goldenschluger and Greenshtein, 2000]. AIC is just one representative of a broad range of funnellike techniques motivated by the over-fitting argument, including cross-validation [Hjorth, 1994], Mallows’ [1973] statistic, minimum description length [Gr¨ unewald, 2007], minimum message length, and structural risk minimization [Vapnik, 1995]. There are, of course, some objections to the over-fitting argument. (1) The argument irrevocably ties Ockham’s razor to randomness. Intuitively, however, Ockham’s razor has to do with uniformity of nature, conservation laws, symmetry, sequential patterns, and other features of the universe that may be entirely deterministic and discretely observable without serious concerns about measurement error. (2) Over-fitting arguments are sometimes presented vaguely in terms of “minimizing” MSE, without much attention to the awkward decision depicted in figure 6 and the consequent need to invoke either prior probabilities or minimax regret as
Simplicity, Truth, and Probability
997
a decision rule.10 In particular, figure 6 should make it clear that computer simulations of Ockham strategies at “typical” parameter values should not be taken seriously by classical statisticians, who reject prior probabilistic representations of ignorance. (3) MSE can be challenged as a correct explication of accuracy in some applications. For an extreme example, suppose that an enemy soldier is aiming directly at you. There happens to be a rifle welded to a lamp post that would barely miss your opponent and another, perfectly good rifle is lying free on the ground. If you value your life, you will pick up the rifle on the ground and aim it earnestly at your opponent even if you know that the welded rifle has lower MSE with respect to the intended target. For that reason, perhaps, military marksmanship is scored in terms of hits vs. misses on a human silhouette [U.S. Army, 2003] rather than in terms of MSE from a geometrical bull’s eye. (4) Finally, the underlying sense of accurate prediction does not extend to predicting the results of novel policies that alter the underlying sampling distribution and, therefore, is too narrow to satisfy even the most pragmatic instrumentalist. That important point is developed in detail in the following section on causal discovery and prediction.
4
OCKHAM’S CAUSAL RAZOR
Suppose that one employs a model selection technique justified by the over-fitting argument to accurately estimate the incidence of lung cancer from the concentration of nicotine on teeth and suppose that a strong statistical “link” is found and reported breathlessly in the evening news. Nothing in the logic of over-fitting entails that the estimated correlation would accurately predict the cancer-reducing efficacy of a public tooth-brushing subsidy, for enactment of the policy would change the underlying sampling distribution so as to sever the “link”. Getting the underlying causal theory wrong can make even the most accurate predictions about the actual population useless for predicting the counterfactual results of enacting new policies that alter the population. 10 Readers familiar with structural risk minimization (SRM) may suspect otherwise, because SRM theory is based on a function b(α, n, c) such that with worst-case chance 1 − α, the true MSE of using model T of complexity c for predictive purposes is less than b(α, n, c) [Vapnik, 1995]. The SRM rule starts with a fixed value α > 0 and sample size n and a fixed sequence of models of increasing complexity and then chooses for predictive purposes (at sample size n) the model whose worst-case MSE bound b(α, n, c) is least. Note, however, that the bound is valid only when the model in question is selected and used for predictive purposes a priori. Since b can be expressed as a sum of a measure of badness of fit and a term taxing complexity, SRM is just another version of an Ockham funnel (albeit with a diameter larger than that of AIC). Therefore, the MSE of SRM will be higher than that of the theory SRM selects at the“bumps” in MSE depicted in figure 6. So the (short-run) decision theory for SRM ultimately poses the same problems as the decision theory for AIC. In the long run, SRM converges to the true model and AIC does not but, as has already been explained, long-run convergence does not explain Ockham’s razor.
998
Kevin T. Kelly
A possible response is that causal conclusions require controlled, randomized trials, in which case the sample is already taken from the modified distribution and the logic of over-fitting once again applies. But controlled experiments are frequently too expensive or too immoral to perform. Happily, there is an alternative to the traditional dilemma between infeasible experiments and causal skepticism: recent work on causal discovery [Spirtes et al., 2000; Verma and Pearl, 1991] has demonstrated that there is, after all, a sense in which patterns of correlations among several (at least three) variables can yield conclusions about causal orientation. The essential idea is readily grasped. Let X → Y abbreviate the claim that X is a direct cause of Y . Consider the causal situations depicted in figure 7. It is helpful to think of variables as measurements of flows in pipes and of causal
causal chain W
causal chain W
common cause W
common effect W
Y
Y
Y
Y
X
X
X
X
Figure 7. Causal situations relation X → Y as a pipe with water flowing from flow meter X to flow meter Y [Heise, 1973]. In the causal chain W → Y → X, we have three meters connected by a straight run of pipe, so it is clear that information about one meter’s reading would provide some information about the other meter readings. But since W informs about X only in virtue of providing information about Y , knowledge of X provides no further information about W than Y does — in jargon, X is independent of W conditional on Y . By symmetrical logic, the same holds for the inverted chain X → Y → W . The common cause situation W ← Y → X is the same: W provides information about X only in virtue of providing information about the common cause Y so, conditional on Y , W is independent of X. So far, the situation is pretty grim — all three situations imply the same conditional dependence relations. But now consider the common effect W → Y ← X. In that case, W provides no information about X, since the two variables are causally independent and could be set in any combination. But conditional on Y , the variable X does provide some information about W because both W and X must collaborate in a specific manner to produce the observed value of Y . Thus, the common effect implies a pattern of dependence and conditional dependence distinct from the pattern shared by the remaining three alternatives. Therefore, common effects and their consequences can be determined from observable conditional dependencies
Simplicity, Truth, and Probability
999
holding in the data. There is more. A standard skeptical concern is the possibility that apparent causal relation W → X is actually produced by a latent or unobserved common cause W ← C → X (just as a puppeteer can make one puppet appear to speak to another). Suppose, for example, that Z is a direct effect of common effect Y . Consider the skeptical alternative in which Y → Z is actually produced by a hidden common cause C of Y and Z (fig. 8). But the skeptical alternative leaves
W Y X
C
W Z
Y
Z
X Figure 8. Confounding hidden cause
a footprint in the data, since in the confounded situation Z and W are dependent given Y (since W provides some information about C given Y and C, as a direct cause of Y , provides some information about Y ). In the non-confounded situation, the reverse pattern of dependence obtains: W is independent of Z given Y because Z yields information about W only in virtue of the information Z yields about Y . So it is possible, after all, to obtain non-confoundable causal conclusions from non-experimental data. Given the true causal theory relating some variables of interest and given an accurate estimate of the free parameters of the theory, one can obtain accurate counterfactual predictions according to a natural rule: to predict the result of intervening on variable X to force it to assume value x, first erase all causal arrows into X, holding other theory parameters fixed at their prior values and now use the modified theory to predict the value of the variable of interest, say Y . Thus, for example, if X is itself an effect, forcing X to assume a value will break all connections between X and other variables, so the values of other variables will be predicted not to change, whereas if X is a cause, forcing X to assume a value will alter the values of the effects of X. The moral is that accurate counterfactual predictions depend on inferring the causal model corresponding to the true causal relations among the variables of interest — causal hypotheses are not merely a way to constrain noise in actual empirical estimates. Causal discovery from non-experimental data depends crucially on Ockham’s razor in the sense that causal structure is read off of patterns of conditional correlations and there is a bias toward assuming that a conditional correlation is zero. That is a version of Ockham’s razor, because non-zero conditional correlations are free parameters that must be estimated in order to arrive at predictions. Absent
1000
Kevin T. Kelly
any bias toward causal theories with fewer free parameters, one would obtain no non-trivial causal conclusions, since the most complex theory entails a causal connection between each pair of variables and all such causal networks imply exactly the same patterns of conditional statistical dependence. But since the over-fitting argument does not explain how such a bias conduces to the identification of true causal structure, it fails to justify Ockham’s razor in causal discovery from nonexperimental data. The following, novel, alternative explanation does. 5
EFFICIENT PURSUIT OF THE TRUTH
To summarize the preceding discussion, the puzzle posed by Ockham’s razor is to explain how a fixed bias toward simplicity is conducive to finding true theories. The crux of the puzzle is to specify a concept of truth-conduciveness according to which Ockham’s razor is more truth-conducive than competing strategies. The trouble with the standard explanations is that the concepts of truth-conduciveness they presuppose are respectively either too weak or too strong to single out Ockham’s razor as the most truth-conducive inferential strategy. Mere convergence to the truth is too weak, since alternative strategies would also converge to the truth. Reliable indication or tracking of the truth in the short run, on the other hand, is so strict that Ockham’s razor can be shown to achieve it only by circular arguments (the Bayes factor argument) or by substituting accurate, non-counterfactual predictions for theoretical truth (over-fitting argument). There is, however, a third option. A natural conception of truth-conduciveness lying between reliable indication of the truth and mere convergence to the truth is effective pursuit of the truth. Effective pursuit is not necessarily direct or even
truth indication
truth pursuit
truth convergence
Figure 9. Three concepts of truth-conduciveness bounded in time or complexity (e.g., pursuit through a labyrinth of unknown extent). But neither is effective pursuit entirely arbitrary — gratuitous course reversals and cycles should evidently be avoided. Perhaps, then, Ockham’s razor is the best possible way to pursue theoretical truth, even though simplicity cannot point at or indicate the true theory in the short run and even though alternative methods would have converged to the truth eventually.
Simplicity, Truth, and Probability
1001
In the pursuit of truth, a course reversal occurs when one retracts or takes back an earlier belief, as when Ptolemaic theory was replaced by Copernican theory. It caused a sensation when Thomas Kuhn [1962] argued that scientific change essentially involves losses or retractions of content and invoked the tremendous cognitive cost of retooling entailed by such changes to explain retention of one’s theoretical ideas in the face of anomalies. Emphasis on cognitive retooling may suggest that retractions are a merely “pragmatic” cost, but deeper considerations point to their epistemic relevance. (1) Potential retractions have been invoked in philosophical analyses of the concept of knowledge since ancient times. Plato traced the essential difference between knowledge and true belief to the stability of knowledge in his dialogue Meno and subsequent authors have expanded upon that theme in attempts to provide indefeasibility accounts of knowledge. For example, suppose that one has good but inconclusive evidence E that Jones owns a Ford when, in fact, only Smith has one and that one believes, on the basis of E that either Smith or Jones owns a Ford [Gettier, 1963]. It seems that the inferred belief is not known. Indefeasibility analyses of knowledge (e.g., [Lehrer, 1990]) attempt to explain that judgment in terms of the the potential for retracting the disjunctive belief when the grounds for the false belief are retracted. (2) Deductive logic is monotonic, in the sense that additional premises never yield fewer conclusions. Inductive logic is non-monotonic, in the sense that additional premises (new empirical evidence) can undermine conclusions based on earlier evidence. Non-monotonicities are retractions of earlier conclusions, so to minimize retractions as far as finding the truth allows is to approximate deduction as closely as finding the truth allows. (3) In mathematical logic, a formal proof system provides a computable, positive test for theorem-hood — i.e., a Turing machine that halts with “yes” if and only if the given statement is a theorem. The halting condition essentially bounds the power of sound proof systems. But nothing other than convention requires a Turing machine to halt when it produces an answer — like human scientists and mathematicians, a Turing machine can be allowed to output a sequence of revised answers upon receipt of further inputs, in an unending loop. Hilary Putnam [1965] showed that Turing machines that are allowed to retract prior answers at most n + 1 times prior to convergence to the truth can do more than Turing machines that are allowed to retract at most n times. Furthermore, formal verifiability (halting with “yes” if and only if φ is a theorem) is computationally equivalent to finding the right answer with one retraction starting with “no” (say “no” until the verifier halts with “yes” and then retract to “yes”), refutation is computationally equivalent to finding the right answer with one retraction starting with “yes” and formal decidability is computationally equivalent with finding the right answer with no retractions. So retraction bounds are a natural and fundamental generalization of the usual computational concepts of verifiability, refutability, and decidability [Kelly, 2004]. The idea is so natural from a computational viewpoint that theoretical computer scientists interested in inductive inference have developed an elaborate theory of inductive retraction complexity [Case and Smith, 1983; Freivalds and Smith, 1993]. (4) Fi-
1002
Kevin T. Kelly
nally, and most importantly, the usual reason for distinguishing epistemic from merely pragmatic considerations is that the former are truth-conducive and the latter conduce to some other aim (e.g., wishful thinking is happiness-conducive but not truth-conducive). Retraction-minimization (i.e., optimally direct pursuit of the truth) is part of what it means for an inductive inference procedure to be truth-conducive, so retractions are a properly epistemic consideration. Additional costs of inquiry may be considered in addition to retractions: e.g., the number and severity of erroneous conclusions are a natural epistemic cost, and the times elapsed until errors and/or retractions are finally avoided. But retractions are crucial for elucidating the elusive, truth-finding advantages of Ockham’s razor, for reasons that will become apparent below. 6
EMPIRICAL SIMPLICITY DEFINED
In order to prove anything about Ockham’s razor, a precise definition of empirical simplicity is required. The basic approach adopted here is that empirical complexity is a reflection of empirical effects relevant to the theoretical inference problem addressed. Thus, empirical complexity is not a mere matter of notation, but it is relative to the kind of truth one is trying to discover. An empirical effect is just a verifiable proposition — a proposition that might never be known to be false, but that comes to be known, eventually, if it is true. For example, Newton [1726] tested the identity of gravitational and inertial mass by swinging large pendula filled with identical weights of different kinds of matter and then watching to see if they ever went noticeably out of phase. If they were not identical in phase, the accumulating phase difference would have been noticeable eventually. Particle reactions are another example of empirical effects that may be very difficult to produce but that, once observed, are known to occur. Again, two open intervals through which no constant curve passes constitute a first-order effect, three open intervals through which no line passes constitute a second-order effect, and so forth (fig. 10.a-c).11 Effects can be arbitrarily small or arbitrarily arcane, so they can take arbitrarily long to notice. Let E be a countable set of possible effects.12 Let the empirical presupposition K be a collection of finite subsets of E. It is assumed that each element of K is a possible candidate for the set of all effects that will ever be observed. The theoretical question Q is a partition of K into sets of finite effect sets. Each partition cell in Q corresponds to an empirical theory that might be true. Let TS denote the (unique) theory in Q that corresponds to finite effect set S in K. For example, the hypotheses of interest to Newton can be identified, respectively, with the absence 11 That
is very close to Karl Popper’s [1968] discussion of degrees of falsifiability, except that his approach assumed exact data rather than intervals. The difference is crucial to the following argument. 12 Effects are here assumed to be primitive. A more ambitious and explanatory approach, in which the problem (K, Q) is presented in terms of mere observations and effects are constructed from the topological structure of (K, Q) is developed in [Kelly, 2007; 2008].
Simplicity, Truth, and Probability
1003
(a)
(b)
(c)
Figure 10. First, second, and third order effects of an out-of-phase effect or the eventual appearance of an out-of-phase effect. The hypothesis that the degree of an unknown polynomial law is n can similarly be identified with an effect — refutation of all polynomial degrees < n. In light of the above discussion of causal inference, each linear causal network corresponds to a pattern of partial correlation effects (note that conditional dependence is noticeable, whereas independence implies only absence of verification of dependence). Each conservation theory of particle interactions can be identified with a finite set of effects corresponding to the discovery of reactions that are not linearly dependent on known reactions [Schulte, 2000; Luo and Schulte, 2006].13 The pair (K, Q) then represents the scientist’s theoretical inference problem. The scientist’s aim is to infer the true answer to Q from observed effects, assuming that the true effect set is in K. Now empirical simplicity will be defined with respect to inference problem (K, Q). Effect set S conflicts with S ′ in Q if and only if TS is distinct from TS ′ . Let π be a finite sequence of sets in K. Say that π is a skeptical path in (K, Q) if and only if for each pair S, S ′ of successive effect sets along π, effect set S is a subset of S ′ and S conflicts with S ′ in Q. Define the empirical complexity c(S) of effect set S relative to (K, Q) to be a − 1, where a denotes the length of the 13 Ptolemy’s theory can be tuned to duplicate Copernican observations for eternity, so the two theories share an effect set. The proposed framework does not apply to that case unless it is assumed that a Ptolemaic universe would not duplicate Copernican appearances for eternity. One reason for ruling out the possibility of an eternally perfect illusion is that no possible method could converge to the truth in such an empirical world, so even optimally truth-conducive methods fail in such worlds. The proposed account focuses, therefore, on empirical adequacy (i.e., consistency with all possible experience), rather than on inaccessible truths transcending all possible experience.
1004
Kevin T. Kelly
longest skeptical path through (K, Q) that terminates in S.14 Let the empirical complexity c(T ) of theory T denote the empirical complexity of the least complex effect set in T . A skeptical path through (K, Q) poses an iterated problem of induction to a would-be solver of problem (K, Q), since every finite sequence of data received from a given state on such a path might have been produced by a state for which some alternative answer to Q is true. That explains why empirical complexity ought to be relevant to the problem of finding the true theory. Problem-solving effectiveness always depends on the intrinsic difficulty of the problem one is trying to solve and the depth of embedding of the problem of induction determines how hard it is to find the truth by inductive means. Since syntactically defined simplicity (e.g., [Li and Vitanyi, 1993]) can, but need not, latch onto skeptical paths in (K, Q), it does not provide such an explanation. Let e be some input information. Let Se denote the set of all effects verified by e. Define the conditional empirical complexities c(S | e), c(T | e) in (K, Q) just as before, but with respect to the restricted problem (Ke , Q), where Ke denotes the set of all effect sets S in K such that Se is a subset of S. 7
INQUIRY AND OCKHAM’S RAZOR
The next step is to provide a precise model of inquiry concerning the problem (K, Q). A stream of experience is an input sequence that presents some finite set (possibly empty) of empirical effects at each stage. Let Sw denote the effect set whose effects are exactly those presented by stream of experience w. An empirical world for K is an infinite stream of experience such that Sw is an element of K. Let Tw denote TSw . An empirical strategy or method for the scientist is a mapping M from finite streams of experience to theories in Q or to ‘?’, which corresponds to a skeptical refusal to choose any theory at the moment. Let w|i be the initial segment of length i of empirical world w. Method M converges to the truth in problem (K, Q) if and only if for each empirical world w for K: lim M (w|i) = Tw . i
Methodological principles can be viewed as restrictions on possible scientific strategies. For example, strategy M is logically consistent if and only if Se is a subset of M (e), for each finite input sequence e. Strategy M satisfies Ockham’s razor if and only if M chooses no theory unless it is the uniquely simplest theory compatible with experience, where simplicity is relative to (K, Q), as described above. As stated, Ockham’s razor allows for any number of counter-intuitive vacillations between some theory T and ‘?’. A natural, companion principle requires that one hang onto one’s current theory choice T as long as T remains uniquely 14 The reason for subtracting 1 is to assign complexity 0 to the simplest states, since each such state S is reached by a path (S) of length 1. There is a maximum precedence path to S because of the assumption that S is finite.
Simplicity, Truth, and Probability
1005
simplest among the theories compatible with experience.15 Call that principle stalwartness. A third principle is eventual informativeness, which says that the method cannot stall with ‘?’ for eternity. A normal Ockham method is a method that satisfies Ockham’s razor, stalwartness, and eventual informativeness. The first bit of good news is: PROPOSITION 1. Normal Ockham methods are logically consistent and converge to the truth. Proof. Let M be a method that is normally Ockham for (K, Q). Logical consistency follows immediately from Ockham’s razor. For convergence, let w be an empirical world for K. Since the effect set Sw presented by w is finite, it follows that only finitely many effect sets in K are simpler than Sw . After some finite stage of inquiry, the finitely many effects in Sw are presented by w and from that point onward, Sw is the uniquely simplest state compatible with experience. At some later stage along w, method M produces some answer to Q, by eventual informativeness. Ockham’s razor implies that the answer produced is TS . Stalwartness guarantees that TS is never again dropped along w. ⊣ 8
A BASIC OCKHAM EFFICIENCY THEOREM
The trouble with proposition 1 is that Ockham’s razor is not necessary for mere convergence to the truth: e.g., start out guessing theory T1000 of complexity 1000 without even looking at the data for 1000 stages of inquiry and then switch to a normal Ockham strategy. Efficient convergence rules out all such alternative strategies. Let r(M, w) denote the total number of times along w that M produces an output that does not entail the output produced at the immediately preceding stage (assume that ‘?’ is entailed by every output). If e is a finite stream of experience, define Cj (e) to be the set of all worlds w for K that extend e and that satisfy c(Sw | e) = j. Call Cj (e) the jth empirical complexity set for (K, Q) given e. Define rj (M | e) to be the least upper bound of r(M, w) with respect to all worlds w in complexity set Cj (e) (the least upper bound is ∞ if no finite upper bound exists). Thus, r(M | e) is the worst-case retraction cost of M given e and given that the actual empirical complexity of the world is exactly j. Next, compare alternative, convergent, logically consistent strategies in terms of worst-case retractions, over all possible world complexities. Let e− denote the result of deleting the last entry in e (if e is the empty sequence, then e− = e). Let M, M ′ be two strategies. Say that M is as efficient as M ′ given e if and only if rj (M | e) ≤ rj (M ′ | e), for each complexity set Cj (e). Say that convergent, logically consistent strategy M is efficient given e if and only if M is as efficient as an arbitrary convergent, logically consistent strategy M ′ that agrees with M 15 Since theories are linearly ordered by empirical complexity in this introductory sketch, uniqueness is trivial, but the argument can be extended to the non-unique case, with interesting consequences discussed below.
1006
Kevin T. Kelly
along e− . Inefficiency is a weak property — it entails only that M does worse than some convergent, logically consistent competitor over some complexity set Cj (e). A much more objectionable situation obtains when rj (M ′ | e) > rj (M | e), for each non-empty Cj (e). In that case, say that M strongly beats M ′ given e. Strategy M ′ is weakly beaten by M when M does as well as M ′ over each nonempty complexity set and better in at least one. Then M ′ is strongly (weakly) beaten given e if and only if M ′ is strongly (weakly) beaten by some convergent, logically consistent competitor. A strong beating given e implies a weak beating which, in turn, implies inefficiency. Each of those properties is relative to available information e. Say that such a property holds always just in case it holds for each e compatible with K. It is now possible to state the most basic Ockham efficiency theorem: THEOREM 2. Assume that (i) K is totally ordered by empirical precedence and (ii) each theory is satisfied by a unique effect state. Define efficiency and beating with respect to all convergent, logically consistent methods. Then the following are equivalent: 1. M is always normally Ockham; 2. M is always efficient in terms of retractions; 3. M is never strongly beaten in terms of retractions. The proof has three, straightforward steps. Step I. Let O be a normal Ockham strategy. Suppose that the scientist always employs some fixed normal Ockham strategy O. Let e of length i be the finite sequence of input data received so far. Let r ≤ i be the number of retractions performed by O along e− . Let w be an empirical world in Cj (e). By stalwartness, O retracts at most j times after stage i along w. Thus, rj (O | e) ≤ r + j if O does not retract at i and rj (O | e) ≤ r + j + 1 otherwise (figure 11).
T0
1
T1
2
T2
3
T3
4
T4
...
Figure 11. Sequential retractions of normal Ockham methods Step II. Suppose that the scientist switches at stage i from normal Ockham strategy O to some arbitrary, convergent, logically consistent method M that agrees with O along e− . Suppose that Cj (e) is non-empty, so there exists skeptical path (S0 , . . . , Sj ) through (K, Q). Nature can present M with an endless stream of data extending e that presents only effects true in S0 until, on pain of failing to converge to the truth, M converges to TS0 . Thus, if O happens to retract at stage i, then M retracts to TS0 no sooner than i, since M (e− ) = O(e− ). Thereafter,
Simplicity, Truth, and Probability
1007
nature can present just the effects true in S1 followed by no more effects until, on pain of failing to converge to the truth, M switches to TS1 . Iterate that argument until M produces TSj . Since the path is skeptical, it follows that M retracts at least j times after (possibly) retracting to TS0 , so: rj (M | e) ≥ r + j + 1 ≥ rj (O | e) if O retracts at i; rj (M | e) ≥ r + j + 0 ≥ rj (O | e) otherwise. So for each convergent, logically consistent M agreeing with O along e− and for each j such that Cj (e) is non-empty, we have that rj (O | e) ≤ rj (M | e). So O is retraction efficient given e. Since e is arbitrary in the preceding argument, O is always retraction efficient. Step III. Finally, suppose that M violates Ockham’s razor at the last entry of input sequence e compatible with K. Since M is logically consistent and effect sets are totally ordered, it follows that M produces a theory T more complex than the simplest theory TS0 compatible with e. Since that is the first Ockham violation by M , we know that M did not also produce Tj at stage i − 1. Therefore, M retracts at i if O does. Suppose that Cj (e) is non-empty. Let skeptical path (S0 , . . . , Sj ) witness that fact. Thereafter, as in the preceding paragraph, nature can force M to retract T back to TS0 and can then force another j retractions. Note that O does not perform the (needless) retraction from T back to TS0 (e.g., the retraction from T4 to T2 in figure 12), so: extra retraction
T0
1
T1
T2
3 T3
4
5
2
T4
...
Ockham violation
Figure 12. Ockham violator’s extra retraction
rj (M | e) ≥ r + j + 2 > r + j + 1 ≥ rj (O | e) if O retracts at i; rj (M | e) ≥ r + j + 1 > r + j + 0 ≥ rj (O | e) otherwise. Thus, O strongly beats M at e in terms of retractions. Suppose, next, that M violates stalwartness given e. Then it is immediate that M retracts one extra time in each TSi compatible with e in comparison with O. Method M cannot violate eventual informativeness, since that would imply failure to converge to the truth.⊣ Unlike over-fitting explanations, the Ockham efficiency theorem applies to deterministic questions. Unlike the Bayes factor explanation, the Ockham efficiency
1008
Kevin T. Kelly
theorem does not presuppose a question-begging prior bias in credence toward simple worlds — every world is as important as every other. The crux of any non-circular epistemic argument for Ockham’s razor is to explain why leaping to a needlessly complex theory makes one a bad truth-seeker even if that theory happens to be true. To see how the hard case is handled in the Ockham efficiency theorem, note that even if T4 is true in figure 12, leaping straight to T4 when experience refutes T1 provides nature with a strategy to force one through the sequence of theories T4 , T2 , T3 , T4 , which not only adds an extra retraction to the optimal sequence T2 , T3 , T4 but also involves an embarrassing cycle away from T4 and back to T4 . In terms of the metaphor of pursuit, it is as if a heat-seeking missile passed its target and had to make a hairpin turn back to it — a performance likely to motivate some re-engineering. Normal Ockham strategies do not dominate alternative strategies in the sense of having a better outcome in every possibility, since an Ockham violator can be saved from the embarrassment of an extra retraction (and look like a genius) if nature is kind enough to provide the anticipated empirical effects before she loses confidence in her complex theory. Nor are Ockham strategies admissible, in the sense of not being weakly dominated by an alternative method — indeed, every normal Ockham strategy is dominated in error times and retraction times by a strategy that stalls with ‘?’ for a longer time prior to producing an answer. That reflects the special structure of the problem of inductive inquiry — waiting longer to produce an informative answer avoids more possibilities for setbacks, but waiting forever precludes finding the truth at all. Nor are Ockham strategies minimax solutions, in the sense that they minimize worst-case overall cost, since the overall worst-case bound on each of the costs under consideration is infinity for arbitrary, convergent methods. The Ockham efficiency property is essentially a hybrid of admissibility and minimax reasoning. First, one partitions all problem instances according to empirical complexity and then one compares corresponding worst-case bounds over these complexity classes. The idea is borrowed from the standard practice for judging algorithmic efficiency [Gary and Johnson, 1979]. No interesting algorithm can find the answer for an arbitrarily large input under a finite resource bound, so inputs are routinely sorted by length and worst-case bounds over each size are compared. In the case of empirical inquiry, the inputs (worlds of experience) are all infinite, so length is replaced with empirical complexity. 9
STABILITY, ERRORS AND RETRACTION TIMES
Theorem 2 establishes that, in a specific sense, the normal Ockham path is the straightest path to the truth. But the straightest path also a narrow path that one might veer from inadvertently. Complex theories have been proposed because no simpler theory had yet been conceived of or because the advantages of a simpler theory were not yet recognized as such (e.g., Newton dismissed the wave theory of light, which was simpler than his particle theory, because he mistakenly thought it could not explain shadows). Theorem 2 does not entail that one should return to
Simplicity, Truth, and Probability
1009
the straightest path, having once departed from it. For example, suppose that at stage i − 1, method M violates Ockham’s razor by producing needlessly complex theory T when TS0 is the simplest theory compatible with experience. Let O be just like M prior to i and switch to a normal Ockham method thereafter. Then at stage i, method M saves a retraction compared to O by retaining Tk — nature can force a retraction back to Tm — but that is the same retraction O performs at i anyway. So the justification of normal Ockham strategies is unstable in the sense that retraction efficiency does not push an Ockham violator back onto the Ockham path after a violation has already occurred. The persistent Ockham violator M does incur other costs. For example, M produces more false answers than O from stage i onward over complexity set C0 (e), since O produces no false outputs after e− along an arbitrary world in C0 (e). Furthermore, both M and O commit unbounded errors, in the worst case over Cj (e), if Cj (e) is non-empty and j > 0. So returning to the normal Ockham fold weakly beats persistent violation, in terms of retractions and errors, at each violation. It would be better to show that Ockham violators are strongly beaten at each violation. Such an argument can be given in terms of retraction times. The motivation is, again, both pragmatic and epistemic. Pragmatically, it is better to minimize the accumulation of applications of a theory prior to its retraction, even if that theory is true, since retraction occasions a reexamination of all such applications. Epistemically, belief that is retracted in the future does not count as knowledge even if it is true [Gettier, 1963]. It would seem, therefore, that more retractions in the future imply greater distance from knowledge than do fewer such retractions. Hence, in the sole interest of minimizing one’s distance from the state of knowledge, one ought to get one’s retractions over with as soon as possible. Considerations of timing occasion the hard question whether a few very late retractions are worse than many early ones. Focus entirely on the easy (Pareto) comparisons in which total cost and lateness both agree. Let (j0 , j1 , . . . , jr ) denote the sequence of times at which M retracts prior to stage i, noting that r also records the total number of retractions. Let σ, τ be such sequences. Say that σ is as bad as τ just in case there is a sub-sequence σ ′ of σ whose length is identical to the length of τ and whose successive entries are all at least as great as the corresponding entries in τ . Furthermore, σ is worse than τ if and only if σ is as bad as τ but τ is not as bad as σ. For example, (2, 4, 8) is worse than (3, 7), in light of the sub-sequence (4, 8). The efficiency argument for O goes pretty much as before. Suppose that M violates Ockham’s razor at e of length i. Let O be just like M along e− and switch to a normal Ockham strategy from stage i onward. Let (k1 , . . . , kr ) be the retraction times of both M and O along e− . Suppose that Cj (e) is non-empty, so there exists a skeptical path (S0 , . . . , Sj ) through (Ke , Qe ). The hard case is the one in which O retracts at i and M does not. Since O is stalwart at i, it follows that T = M (e− ) 6= TS0 . Nature can refuse to present new effects until M retracts T in favor of TS0 , and can then force an arbitrarily late retraction for each step along the path (S0 , . . . , Sj ). Method O retracts at most j
1010
Kevin T. Kelly
times over Cj (e) and retracts once at i in C0 (e). Thus: rj (O | e)
≤ (k0 , k1 , . . . , kr , i, ∞, . . . , ∞) | {z } j
< (k0 , k1 , . . . , kr , i + 1, ∞, . . . , ∞) ≤ rj (M | e). | {z } j
Note that the preceding argument never appeals to logical consistency, which may be dropped. The beating argument for stalwartness violators is easier, since one never saves a retraction by violating stalwartness. Again, violations of eventual informativeness are impossible for convergent methods, so now we have (cf. [Kelly, 2004]): THEOREM 3. Assume conditions (i) and (ii) of theorem 2. Define efficiency and beating with respect to the set of all convergent methods. Then the following are equivalent: 1. M is normally Ockham from e onward; 2. M is efficient in terms of retraction times and errors from e onward; 3. M is never weakly beaten in terms of retractions and errors from e onward; 4. M is never strongly beaten in terms of retraction times from e onward. A stronger version of Ockham’s razor follows if one charges for expansions of belief or for elapsed time to choosing the true theory, for in that case one should avoid agnosticism and select the simplest theory at the very outset to achieve zero loss in the simplest theory compatible with experience. That conclusion seems too strong, however, confirming the intuition that when belief changes, the epistemically costly part is retracting the old belief rather than adopting the new one. This asymmetry between avoiding retractions as soon as possible and finding truth as soon as possible arises again, in a subtle way, when the Ockham Efficiency theorem is extended from theory choice to Bayesian degrees of belief. 10
EXTENSION TO BRANCHING SIMPLICITY
Sometimes, the theories of interest are not ordered sequentially by simplicity, in which case there may be more than one simplest theory compatible with experience. For example, suppose that the question is to find the true form of a polynomial law. For another example, let TS be the theory that the true causal structure is compatible with exactly the partial statistical dependencies in set S. In the inference of linear causal structures with Gaussian error, the branching simplicity structure over models with three variables is exactly the lattice depicted in figure 13 (cf. [Chickering, 2003; Meek, 1995]).
Simplicity, Truth, and Probability
X
C3 Y X
C2
Y
Z
Z
X
X Y
X
Y
Z
Z
X
C1
1011
Y
X
Y
Z
Y
Z
Y
Z
X
X
Z
Y
X
Z
Y
Z
X
C0
Y
Z
Figure 13. Simplicity for acyclic linear causal models When there is more than one simplest theory compatible with experience, Ockham’s razor seems to demand that one suspend judgment with ‘?’ until nature winnows the field down to a unique theory. That judgment is enforced by efficiency considerations. Suppose that, as in the causal case (figure 13), no maximal, skeptical path is longer than another.16 Call that the no short path assumption. Then violating Ockham’s razor by choosing one simplest theory over another incurs an extra retraction in every non-empty complexity set, since nature is free to make the other simplest theory appear true, forcing the scientist into an extra retraction. Thereafter, nature can force the usual retractions along a path that visits each non-empty complexity set Cj (e), by the assumption that no path is short. THEOREM 4 Kelly, 2007. Theorem 3 continues to hold if (i) is replaced by the no short path assumption. Without the no short path assumption, methods that return to the Ockham path are no longer efficient, even in terms of retraction times. Suppose that T0 and T1 are equally simple and that T2 is more complex than T1 but not more complex than T0 . Then T0 and T1 both receive empirical complexity degree 0 and T2 is assigned complexity degree 1. Suppose that method M has already violated Ockham’s razor by choosing T1 when T0 is still compatible with experience. Alas, 16 In the case of acyclic linear causal models with independently distributed Gaussian noise, it is a consequence of [Chickering, 2003] that the only way to add a new implied conditional dependence relationship is to add a new causal connection. Hence, each causal network with n causal connections can be extended by adding successive edges, so there are no short paths in that application and the strong argument for Ockham’s razor holds.
1012
Kevin T. Kelly
sticking with the Ockham violation beats Ockham’s retreating strategy in terms of retractions. For Ockham’s retreat counts as a retraction in C0 (e). Nature can still lure Ockham to choose T0 and can force a further retraction to T1 for a total of 2 retractions in C1 (e). But strategy M retracts just once in C0 (e) and once in C1 (e). In terms of retraction times, there is a hard choice — the born-again Ockham strategy retracts early in C0 (e) and retracts more times in C1 (e). One response to the short path problem is to question whether the short path really couldn’t be extended — if all paths are infinite, there are no short paths. Polynomial theories can always be given another term. In causal networks, one can always study another variable that might have a weak connection with variables already studied. A second response is that the simplicity degrees assigned to theories along a short path are arbitrary as long as they preserve order along the path. The proposed definition of simplicity degrees ranks theories along a short complexity path as low as possible, but one might have ranked them as high as possible (e.g., putting T0 in C1 (e) rather than in C0 (e)), in which case the preceding counterexample no longer holds.17 That option is no longer available, however, if some path is infinite in length and another path is finite in length. The third and, perhaps, best response is to weaken Ockham’s razor to allow for the selection of the theory at the root of the longer path. Violating that version of Ockham’s razor still results in a strong beating in terms of retraction times and methods that satisfy it along with stalwartness at every stage are never strongly beaten. The third option becomes all the more compelling below, when it is entertained that some retractions count more than others due to the amount of content retracted.
11
WHEN DEFEAT DOES NOT IMPLY REFUTATION
The preceding efficiency theorems all assume that each theory is true of just one effect state. It follows that whenever an Ockham conclusion is defeated by new data, it is also refuted by that data. That is not necessarily the case, as when the question concerns whether polynomial degree is even or odd. A more important example concerns the status of a single causal relation X → Y . Figure 14 presents a sequence of causal theories that nature can force every convergent method to produce. Focus on the causal relation between X and Y . Note that the orientation of the edge flips when the inferred common effect at Y is canceled through discovery of new causal connection V − Z and is flipped in the opposite direction by the inference of a common effect at X. The process can be iterated by canceling the new common effect and re-introducing one at Y , etc. So, assuming an unlimited supply of potentially relevant variables, nature can force an arbitrary, convergent method to cycle any number of times between the opposite causal conclusions X → Y and Y → X.18 The causal flips depicted in figure 14 have been elicited (in probability) from the PC causal discovery algorithm [Spirtes et al., 2000] using 17 That 18 In
approach is adopted, for example, in earlier work by [Freivalds and Smith, 1993]. fact, it can be demonstrated that arbitrarily long causal chains can be flipped in this way.
Simplicity, Truth, and Probability
U W
V X
Y
U W
X
Y
Z V
X
Y
U W
Z V
U W
1013
Z V
X
Y
Z
Figure 14. Causal flipping computer simulated random samples of increasing size from a fixed causal model. Note that one can no longer rely on logical consistency to force retractions of defeated theories, so the beating argument provided for theorem 2 fails when assumption (ii) is dropped. Happily, the beating arguments based on retraction times still work, which is yet another motive for considering retraction times in addition to total retractions. THEOREM 5 Kelly, 2006. Theorems 3 and 4 continue to hold without assumption (ii).
12
EXTENSION TO RANDOMIZED SCIENTIFIC STRATEGIES
The preceding theorems assume that the scientist’s method is a deterministic function of the input data. It is frequently the case, however, that randomized or “mixed” strategies achieve lower worst-case losses than deterministic strategies. For example, if the problem is to guess which way a coin lands inside of a black box and the loss is 0 or 1 depending on whether one is right or wrong, guessing randomly achieves a worst-case expected loss bound of 1/2, whereas the lowest worst-case loss bound achieved by either pure (deterministic) strategy is 1. Nonetheless, the Ockham efficiency argument can be extended to show that deterministically stalwart, Ockham strategies are efficient with respect to all convergent mixed scientific strategies, where convergence efficiency is defined in terms of expected retractions and convergence in probability, meaning that the objective chance (grounded in the method’s internal coin-flipper) that the method produces the true theory goes to one as experience increases [Kelly and Mayo-Wilson, 2010].
1014
Kevin T. Kelly
THEOREM 6 Kelly and Mayo-Wilson, 2010. All of the preceding theorems extend to random empirical methods when retractions are replaced with expected retractions and retraction times are replaced with expected retraction times. Here is how it works. A method is said to retract T in chance to degree r at stage k + 1 if the chance that T produces T goes down by r from k to k + 1. Total retractions in chance are summed over theories and stages of inquiry, so as the chance of producing one theory goes up, the chance of producing the remaining theories goes down. Therefore, nature is in a position to force a convergent method to produce total retractions arbitrarily close to i by presenting an infinite stream of experience w making T true. It is readily shown that the total retractions in chance along w are a lower bound on expected total retractions along w. It is also evident that for deterministic strategies, the total expected retractions are just the total deterministic retractions. So, since deterministically Ockham strategies retract at most i times given that T is true, they are efficient over all mixed strategies as well, and violating either property results in inefficiency. The extension of the Ockham efficiency theorem to random methods and expected retraction times suggests a further extension to probabilistic theories and evidence (i.e., statistical theoretical inference). It remains an open question to obtain a result exactly analogous to theorem 6 in the case of statistical theory choice. It is no problem to obtain lower bounds on expected retractions and retraction times that agree with those in the proof of theorem 6. The difficulties are on the positive side — to define appropriate analogues of Ce (n), Ockham’s razor, and stalwartness that allow for the fact that no statistical hypothesis is ever strictly incompatible with the data. 13
DISJUNCTIVE BELIEFS, RETRACTION DEGREES, AND A GETTIER EXAMPLE
Using ‘?’ to indicate refusal to choose a particular theory is admittedly crude. When there are two simplest theories T1 , T2 compatible with the data, it is more realistic to allow retreat to the disjunction T1 ∨ T2 than to a generic refusal to say anything at all — e.g., uncertainty between two equally simple orientations of a single causal arrow does not necessarily require (or even justify) retraction of all the other causal conclusions settled up to that time. Accordingly, method M will now be allowed to produce finite disjunctions of theories in Q. Suppose that there are mutually exclusive and exhaustive theories {Ti : i ≤ n} and let x be a Boolean n-vector. Viewing x as the indicator function of finite set Sx = {i ≤ n : xi = 1}, one can associate with x the disjunction: _ Tx = Ti . i∈Sx
A retraction now occurs whenever some disjunct is added to one’s previous conclusion, regardless how many disjuncts are also removed. Charging one unit per
Simplicity, Truth, and Probability
1015
retraction, regardless of the total content retracted, amounts to the following rule: ρret (Tx , Ty ) = max yi − xi . i
One could also charge one unit for each disjunct added to one’s prior output, regardless how many disjuncts are removed, which corresponds to the slightly modified rule: X ˙ xi , ρdis (Tx , Ty ) = yi − i
˙ assumes value 0 when x ≥ y.19 Assuming no where the cutoff subtraction y −x short simplicity paths, charging jointly for the total number of disjuncts added and the times at which the disjuncts are added allows one to derive stronger versions of Ockham’s razor and stalwartness from retraction efficiency. The strengthened version of Ockham’s razor is that one should never produce a disjunction stronger than the disjunction of all currently simplest theories (disjunctions take the place of ‘?’) and the strengthened version of stalwartness is that one should never disjoin a theory T to one’s prior conclusion unless T is among the currently simplest theories.20 When there are short simplicity paths, the Ockham efficiency argument can fail for both of the proposed retraction measures. The counterexample is reminiscent of Gettier’s [1963] counterexample to the justified true belief analysis of knowledge (fig. 15). Suppose that T0 is simpler than T1 and T2 and that T2 is simpler than T3 . Suppose that experience e is compatible with T0 and that M produces the disjunction of (T0 ∨ T2 ) in response to e “because” M believes T0 on the basis of Ockham’s razor and the disjunction follows from T0 . If T1 true, then M has true belief (T0 ∨ T2 ) “for the wrong reason” — a Gettier case. Suppose that T0 is refuted. An Ockham method should now retract to (T1 ∨ T2 ), but M expands to T2 “because” M believed (T0 ∨ T2 ) and learned that ¬T0 . If the truth is T1 , then both methods have 1 retraction on either retraction measure and Ockham incurs the retraction earlier, so Ockham (barely) wins in C1 (e) after T0 is refuted. But M wins by retracting only once in C2 (e), when T3 is true.21 Possible responses to the issue of short simplicity paths include those discussed above in section 10. 19 The same formula takes a finite value for a countable infinity of dimensions as long as each disjunction has at most finitely many disjuncts. 20 It is still the case that nature can force at least n retractions in complexity set C and n stalwart, Ockham methods retract no more than that. If M violates the strengthened version of Ockham’s razor, M produces a disjunction missing some simplest theory T . Nature is now free to force M down a path of increasingly complex theories that begins with T . By the no short paths assumption, this path passes through each complexity set, so M incurs at least one extra retraction in each complexity set. If M violates the strengthened version of stalwartness, then M retracts by adding a complex disjunct T . Nature is free to present a world of experience for a simplest world, forcing M to retract disjunct T . 21 To see why short paths are essential to the example, suppose that there were a theory T 4 more complex than T1 . Then M would also retract twice in C2 and Ockham would complete the retraction in C1 earlier.
1016
Kevin T. Kelly
T
T
3
T
T
2
1
T
3
T
T
1
2
T
0
0
Ockham
Gettier
Figure 15. Gettier counterexample to Ockham efficiency
14
EXTENSION TO DEGREES OF BELIEF
Bayesian agents may use their degrees of belief to choose among potential theories [Levi, 1983], but they may also regard updated degrees of belief as the ultimate product of scientific inquiry. It is, therefore, of considerable interest to extend the logic of the Ockham efficiency theorems from problems of theory choice to problems of degree of belief assignment. Here are some recent ideas in that direction. Suppose that the theories under consideration are just T1 , T2 , T3 , in order of increasing complexity. Then each prior probability distribution p over these three theories can be represented uniquely as the ordered triple p = (p(T1 ), p(T2 ), p(T3 )). The extremal distributions are the basis vectors i1 = (1, 0, 0), i2 = (0, 1, 0), and i3 = (0, 0, 1) and all other coherent distributions lie on the simplex or triangle connecting these points in three-dimensional Euclidean space. A standard argument for distributing degrees of belief as probabilities [de Finetti, 1975; Rosenkrantz, 1983; Joyce, 1998] is that each point x off of the simplex is farther from the true corner of the simplex (whichever it might be) than the point p on the simplex directly below x, so agents who seek immediate proximity to the truth should stay on the surface of the simplex — i.e., be coherent (fig. 16 (a)). It is natural to extend that static argument to the active pursuit of truth in terms of total Euclidean distance traversed on the surface of the simplex prior to convergence to the truth (fig. 16 (b)). As in section 8, nature has a strategy to force each convergent Bayesian arbitrarily close to i1 , then arbitrarily close to√i2 and then all the √way to i3 . Each side of the triangular simplex has length 2, so if one adopts 2 as the unit of loss, then nature can force retraction bound k in complexity set Ck (e), just as in the discussion of theory choice. Therefore, the path (p, i2 , i3 ) is efficient, since it achieves that bound. Furthermore, suppose that method M favors complex theory T2 over simpler theory T1 by moving from p to q instead of to i2 . Then nature can force M back to i2 by presenting simple data.
Simplicity, Truth, and Probability
1017
i2 i2 x
p i1
q
p (a)
i3
i1
(b)
i3
Figure 16. Distance from the truth vs. efficient pursuit of the truth So the detour through q results, in the worst case, in the longer path (p, q, i2 , i3 ) that hardly counts as an efficient pursuit curve (q is passed twice, which amounts to a needless cycle). An ironic objection to the preceding argument is that the conclusion seems too strong — efficiency measured by total distance traveled demands that one start out with full credence in the simplest theory and that one leap immediately and fully to the newly simplest theory when the previous simplest theory is refuted. Avoidance of that strong conclusion was one of the motives for focusing on retractions as opposed to expansions of belief in problems of theory choice, since movement from a state of suspension to a state of belief is not counted as a retraction. Euclidean distance charges equally for expansions and retractions of Bayesian credence, so it is of interest to see whether weaker results can be obtained by charging only for Bayesian retractions. One approach is to define Bayesian retractions as increases in entropy, defined as: X M (q) = − qi log2 qi . i
That is wrong, however, since the circuit path (i1 , i2 , i1 ) seems to incur two large retractions, but entropy remains constantly 0. A more sophisticated idea is to tally the cumulative increases in entropy along the entire path from p to q, rather than just at the endpoints. But that proposal still allows for “retraction-free” circuits around the entropy peak at the midpoint (1/3, 1/3, 1/3) along a path of constant entropy. The same objection obtains if entropy is replaced with any alternative scalar field that plausibly represents informativeness. Another idea is to measure the retractions from p to q in terms of a popular measure of separation for probability distributions called the Kullback Leibler (KL)
1018
Kevin T. Kelly
divergence from p to q: KL(q|p) =
X i
qi log2
qi . pi
KL divergence is commonly applied to measure motions on the simplex in Bayesian experimental design, where the idea is to design the experiment that maximizes the KL divergence from the prior distribution p to the posterior distribution q [Chaloner and Verdinelli, 1995]. It is well known that KL divergence is not a true distance measure or metric because it is asymmetrical and fails to satisfy the triangle inequality. It is interesting but less familiar that the asymmetry amounts to a bias against retractions: e.g., if p = (1/3, 1/3, 1/3) and q = (.999, .0005, .0005) then KL(p|q) ≈ 5.7 and KL(q|p) ≈ 1.6. Unfortunately, KL divergence cannot be used to measure retractions after a theory is refuted because it is undefined (due to taking log(0)) for any motion terminating at the perimeter of the simplex. But even if one approximates such a motion by barely avoiding the perimeter, KL divergence still charges significantly more for hedging one’s bets than for leaping directly to the current simplest theory. For example, if p = (.999, .0005, .0005), q = (.0001, .5, .4999), r = (.0005, .9995, .0005), then the KL divergence along path (p, r) is nearly 10.9, whereas the total KL divergence along path (p, q, r) is around 17.7. Here is a different approach, motivated by a fusion of logic and geometry, that yields Ockham efficiency theorems closely analogous to those in the disjunctive theory choice paradigm.22 The simplex of coherent probability distributions over T0 , T1 , T2 is just the intersection of the unit cube with a plane through each of the unit vectors (fig. 17). The Boolean vectors labeling vertices of the unit cube are the labels of the possible disjunctions of theories (the origin 0 = (0,0,0) corresponds to the empty disjunction or contradiction). To extend that picture to the entire unit cube, think of Tx as a fuzzy disjunction in which theory Ti occurs to degree xi . Say that Tx is sharp when x is Boolean and say that y is sharp when y is a unit vector. Each vector y in the unit cube can also be viewed as a fuzzy assignment of semantic values to the possible theories. Define the valuation of Tx in y to be P the inner product: τy (Tx ) = y · x = i yi · xi . If y and Tx are both sharp, then τy (Tx ) is the classical truth value of Tx in y and if p is a probability and Tx is sharp, then τp (Tx ) = p(Tx ).23 Entailment is defined by: Tx |= Ty if and only if τz (Tx ) ≤ τz (Ty ), for each vector z in the unit cube. Thus, Tx |= Ty holds if and only if xi ≤ yi , for each i. The resulting entailment relations are isomorphic to the subset relation over the fuzzy subsets of a 3-element set [Zadeh, 1965]. The fully consistent disjunctions are the fuzzy disjunctions that evaluate to 1 in some sharp assignment. They comprise exactly the upper three faces of the unit cube. The vertices of those faces are the consistent, sharp disjunctions of classical logic. The formulas for retraction measures ρret and ρdis are already defined over the entire unit cube and, hence, may be applied directly to probability assign22 The
following definitions and results were developed in collaboration with Hanti Lin. is tempting, but not necessary for our purposes, to define p(Tx ) = p · x for non-sharp Tx as well. 23 It
Simplicity, Truth, and Probability
1019
(1,1,1)
(1,0,1)
φ(p)
(1,1,0)
(1,0,0)
p
(0,1,1)
(1,0,0)
(0,1,0)
(0,0,0)
Figure 17. Simplex and unit cube ments. That is not the right idea, however, for it is natural to view the move from (0, 1/2, 1/2) to (0, 1, 0) as a pure expansion of credence, but both retraction measures assign retraction 1/2 in this case. As a result, efficiency once again demands that one move immediately to full credence in T1 when T0 is refuted. Here is a closely related idea that works. The grain of truth behind probabilistic indifferentism is that the sharp disjunction T(1,1,0) = T1 ∨ T2 more faithfully summarizes or expresses the uniform distribution (1/2, 1/2, 0) than the biased distribution (1/3, 2/3, 0); a view that can be conceded without insisting, further, that uniform degrees of belief should be adopted. One explanation of the indifferentist intuition is geometrical — the components of p = (1/2, 1/2, 0) are proportional to the components of x = (1, 1, 0) in the sense that there exists constant c such that x = cp. To be assertible, a proposition should be fully consistent. Tp satisfies the proportionality condition for p but is not fully consistent. Accordingly, say that Tx expresses p just in case Tx is fully consistent and x is proportional to p. Sharp propositions cannot express non-uniform distributions, but fuzzy propositions can: e.g., T(1/2,1,0) expresses (1/3, 2/3, 0) in much the same, natural way that T(1,1,0) expresses (1/2, 1/2, 0).24 Each fully consistent disjunction has a unit 24 A
disanalogy: τ(1/2,1/2,0) (T(1,1,0) ) = 1, but τ(1/3,2/3,0) (T(1/2,1,0) ) = 5/6, so the expression
1020
Kevin T. Kelly
component, which fixes the constant of proportionality at 1/ maxi pi . Thus, the unique, propositional expression of p is Tφ(p) , where: φ(p)i = pi / max pi . i
Geometrically, φ(p) can be found simply by drawing a ray from 0 through p to the upper surface of the unit cube (fig. 17). One can now define probabilistic retractions as the logical retractions of the corresponding, propositional expressions: ρret (p, q)
= ρret (Tφ(p) , Tφ(q) );
ρdis (p, q)
= ρdis (Tφ(p) , Tφ(q) ).
In passing, one can also define Bayesian expansions of belief by permuting p and q on the right-hand-sides of the above formulas. Revisions are then the sum of the expansions and retractions. Thus, one can extend the concepts of belief revision theory [G¨ ardenfors, 1988] to Bayesian degrees of belief — an idea that may have useful applications elsewhere, such as in Bayesian experimental design. Both retraction measures have the natural property that if Ti is the most probable theory under p, then for each alternative theory Tj , the move from p to the conditional distribution p(.|¬Tj ) incurs no retractions [Lin, 2009]. Moreover, for purely retractive paths (paths that incur 0 expansions), the disjunctive measure is attractively path-independent: ρdis (p, r) = ρdis (p, q) + ρdis (q, r). Most importantly, both measures entail simplicity biases that fall short of the implausible demand that one must leap to the currently simplest theory immediately (fig. 18). For ρret , the zone of efficient moves from p to the next simplest vertex j when nearby vertex i is refuted is constructed as follows. Let c be the center of the simplex, let i be the vertex nearest to p, let m be the mid-point of the side nearest p and let m′ be the midpoint of the side farthest from p (ties don’t matter). Let v be the intersection of line pm′ with line cm. Let o be the intersection of line iv with the side of the simplex farthest from p. Then assuming that credence in the refuted theory drops to 0 immediately, retraction-efficiency countenances moving anywhere on the line segment connecting j and o. For retraction measure ρdis , the construction is the same, except that v is the intersection of cm with pj. Note that when p ≈ i, the Ockham zone for ρret is nearly the entire half-side jm′ , whereas measure ρdis allows only for movement directly to the corner j, as is already required in the disjunctive theory choice setting described in section 13. Thus, the extreme version of Ockham’s razor is tied to the plausible aim of preserving as much content as possible. In practice, however, an open-minded Bayesian never puts full credence in the currently simplest theory and in that case of a uniform distribution is also the support of the distribution, but that fails in the non-uniform case.
Simplicity, Truth, and Probability
1021
j
j
q
efficiency zone
efficiency zone o q
o m
m v c
p
i
v c
p
i
(ρret)
(ρdisj)
Figure 18. Two versions of Ockham’s Bayesian razor the Ockham zone for ρret allows some leeway but is still not liberal enough for Bayesian updating to count as efficient. In both figures, the result q of updating p with the information that Ti is false can be found by drawing a ray from vertex i to the opposite side of the triangle. Note that q falls within the zone of efficiency for retraction measure ρret but not for measure ρdisj . The Gettier-like counterexample presented in section 13 can also arise in 4 dimensions or more for Bayesian agents when the no short path assumption fails (just embed the example into the upper faces of the 4-dimensional unit cube and project it down onto the 3-dimensional simplex contained in that cube). The potential responses reviewed in section 13 apply here as well.
15
CONCLUSION
This study reviewed the major justifications of Ockham’s razor in philosophy, statistics, and machine learning, and found that they fail to explain, in a noncircular manner, how Ockham’s razor is more conducive to finding true theories than alternative methods would be. The failure of standard approaches to connect simplicity with theoretical truth was traced to the concepts of truth-conduciveness underlying the respective arguments. Reliable indication of the truth is too strong to establish without (a) trading empirical truth for accurate prediction or (b) begging the question by means of a prior bias against complex possibilities. Convergence in the limit is too weak to single out simplicity as the right bias to have in the short run. An intermediate concept of truth-conduciveness is effective pursuit of the truth, where effectiveness is measured in terms of such costs as total retractions and errors prior to convergence. Then one can prove, without circularity or substituting predictive accuracy for theoretical truth, that Ockham’s razor is the
1022
Kevin T. Kelly
best possible strategy for finding true theories. That result, called the Ockham efficiency theorem, can be extended to problems with branching paths of simplicity, to problems in which defeated theories are not refuted, to random strategies and, except in some interesting, Gettier-like cases, to Bayesian degrees of belief and to strategies that produce disjunctions of theories. The ultimate goal, which has not yet been reached, is to extend the Ockham efficiency argument to statistical inference. ACKNOWLEDGEMENTS The results on random strategies were obtained in close collaboration with Conor Mayo-Wilson, who also provided detailed comments on the draft. The definitions concerning Bayesian retractions were arrived at in close collaboration with Hanti Lin, who also formulated and proved many of the theorems. Unusually detailed comments were provided by Prasanta Bandyopadhyay and by the anonymous reviewer. They were greatly appreciated. BIBLIOGRAPHY [Akaike, 1973] H. Akaike. A new look at the statistical model identification, IEEE Transactions on Automatic Control, 19: 716-723, 1973. [Carnap, 1950] R. Carnap. Logical Foundations of Probability, Chicago: University of Chicago Press, 1950. [Case and Smith, 1983] J. Case and C. Smith. Comparison of identification criteria for machine inductive inference, Theoretical Computer Science 25:, 193-220, 1983. [Chickering, 2003] D. Chickering. Optimal Structure Identification with Greedy Search, JMLR, 3: 507-554, 2003. [Domingos, 1999] P. Domingos. The Role of Occam’s Razor in Knowledge Discovery, Data Mining and Knowledge Discovery, vol. 3: 409-425. [Duda et al., 2001] R. Duda, P. Hart, and D. Stork. Pattern Classification, New York: Wiley. [Freivalds and Smith, 1993] R. Freivalds and C. Smith. On the Role of Procrastination in Machine Learning, Information and Computation 107: 237-271, 1993. [Forster, 2001] M. Forster. The New Science of Simplicity, in Simplicity, Inference, and Modeling, A. Zellner, H. Keuzenkamp, and M. McAleer, eds., Cambridge: Cambridge University Press, 2001. [Forster and Sober, 1994] M. Forster and E. Sober. How to Tell When Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions, The British Journal for the Philosophy of Science 45: 1-35, 1994. [Friedman, 1983] M. Friedman. Foundations of Space-time Theories, Princeton: Princeton University Press, 1983. [G¨ ardenfors, 1988] P. G¨ ardenfors. Knowledge in Flux, Cambridge: MIT, 1988. [Garey and Johnson, 1979] M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness, New York: Wiley, 1979. [Gettier, 1963] E. Gettier. Is Justified True Belief Knowledge? Analysis 23: 121-123, 1963. [Glymour, 1980] C. Glymour. Theory and Evidence, Princeton: Princeton University Press, 1980. [Glymour, 2001] C. Glymour. Instrumental Probability, Monist 84: 284-300, 2001. [Goldenshluger and Greenschtein, 2001] A. Goldenshluger and E. Greenschtein. Asymptotically minimax regret procedures in regression model selection and the magnitude of the dimension penalty, Annals of Statistics, 28: 1620-1637, 2001.
Simplicity, Truth, and Probability
1023
[Gr¨ unewald, 2007] P. Gr¨ unewald. The Minimum Description Length Principle, Cambridge, MIT Press, 2007. [Harman, 1965] G. Harman. The Inference to the Best Explanation, Phil Review 74: 88-95, 1965. [Heise, 1975] D. Heise. Causal Analysis, New York: John Wiley and Sons, 1975. [Hjorth, 1994] J. Hjorth. Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap, London: Chapman and Hall, 1994. [Jeffreys, 1961] H. Jeffreys. Theory of Probability, 3rd ed., London: Oxford University Press, 1961. [Joyce, 1998] J. Joyce. A Nonpragmatic Vindication of Probabilism, Philosophy of Science 65: 73-81, 1998. [Kass and Raftery, 1995] R. Kass and A. Raftery. Bayes Factors, Journal of the American Statistical Association 90: 773-795, 1995. [Kelly, 1996] K. Kelly. The Logic of Reliable Inquiry, New York: Oxford, 1996. [Kelly, 2007] K. Kelly. How Simplicity Helps You Find the Truth Without Pointing at it, V. Harazinov, M. Friend, and N. Goethe, eds. Philosophy of Mathematics and Induction, Dordrecht: Springer, pp. 321-360, 2007. [Kelly, 2008] K. Kelly. Ockham’s Razor, Truth, and Information. In Philosophy of Information, Van Benthem, J. Adriaans, P. eds. Dordrecht: Elsevier, pp. 321-360, 2008. [Kelly and Mayo-Wilson, 2009] K. Kelly and C. Mayo-Wilson. Ockham Efficiency Theorem for Random Empirical Methods, Formal Epistemology Workshop, 2009, http://fitelson.org/ few/kelly\_mayo-wilson.pdf. [Kelly and Schulte, 1995] K. Kelly and O. Schulte. The Computable Testability of Theories with Uncomputable Predictions, Erkenntnis 43: 29-66, 1995. [Kitcher, 1982] P. Kitcher. Explanatory Unification, Philosophy of Science 48:507-531 1982. [Kyburg, 1977] H. Kyburg. Randomness and the Right Reference Class, The Journal of Philosophy, 74: 501-521, 1977. [Kuhn, 1957] T. Kuhn. The Copernican Revolution, Cambridge: Harvard University Press, 1957. [Kuhn, 1962] T. Kuhn. The Structure of Scientific Revolutions, Chicago: University of Chicago Press, 1962. [Lehrer, 1990] K. Lehrer. Theory of Knowledge, Boulder: Westview Press, 1990. [Levi, 1974] I. Levi. On Indeterminate Probabilities, Journal of Philosophy 71: 397-418, 1974. [Levi, 1983] I. Levi. The Enterprise of Knowledge: An Essay on Knowledge, Credal Probability, and Chance, Cambridge: MIT Press, 1983. [Lewis, 1987] D. Lewis. A Subjectivist’s Guide to Objective Chance, in Philosophical Papers Volume II, Oxford: Oxford University Press, pp. 83-133, 1987. [Li and Vitanyi, 1993] M. Li and P. Vitanyi. An Introduction to Kolmogorov Complexity and its Applications, New York: Springer, 1993. [Luo and Schulte, 2006] W. Luo and O. Schulte. Mind change efficient learning, Information and Computation, 204:989-1011, 2006. [Mallows, 1973] C. Mallows. Some comments on Cp, Technometrics 15: 661-675, 1973. [Mayo, 1996] D. G. Mayo. Error and the Growth of Experimental Knowledge, Chicago: The University of Chicago Press, 1996. [Meek, 1995] C. Meek. Strong completeness and faithfulness in Bayesian networks, Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal, pp. 411-418, 1995. [Mitchell, 1997] T. Mitchell. Machine Learning, New York: McGraw Hill, 1997. [Myrvold, 2003] W. Myrvold. A Bayesian Account of the Virtue of Unification, Philosophy of Science:399-423, 2003. [Newton, 1726] I. Newton. Philosophiae Naturalis Principia Mathematica, London, 1726. [Popper, 1968] K. Popper. The Logic of Scientific Discovery, New York: Harper, 1968. [Putnam, 1965] H. Putnam. Trial and Error Predicates and a Solution to a Problem of Mostowski, Journal of Symbolic Logic 30: 49-57, 1965. [Rissanen, 2007] J. Rissanen. Information and Complexity in Statistical Modeling, New York: Springer-Verlag, 2007. [Rosenkrantz, 1983] R. Rosenkrantz. Why Glymour is a Bayesian, in Testing Scientific Theories, Minneapolis: University of Minnesota Press, 1983. [Salmon, 1967] W. Salmon. The Logic of Scientific Inference, Pittsburgh: University of Pittsburgh Press, 1967.
1024
Kevin T. Kelly
[Schulte, 1999] O. Schulte. Means-Ends Epistemology, The British Journal for the Philosophy of Science, 50: 1-31, 1999. [Schulte, 2000] O. Schulte. Inferring Conservation Principles in Particle Physics: A Case Study in the Problem of Induction, The British Journal for the Philosophy of Science, 51: 771-806, 2000. [Spirtes et al., 2000] P. Spirtes, C. Glymour and R. Scheines. Causation, Prediction, and Search, second edition, Cambridge: MIT Press, 2000. [Teller, 1976] P. Teller. Conditionalization, Observation, and Change of Preference, in Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science, W. Harper and C. Hooker, eds., Dordrecht: D. Reidel, 1976. [US Army, 2003] U.S. Army. Rifle Marksmanship M16A1, M16A2/3, M16A4, and M4 Carbine, FM 3-22.9, Headquarters, Dept. of the Army, 2003. [Vapnik, 1995] V. Vapnik. The Nature of Statistical Learning Theory, Berlin: Springer, 1995. [Verma and Pearl, 1991] T. Verma and J. Pearl. Equivalence and Synthesis of Causal Models, Uncertainty in Artificial Intelligence 6:220-227, 1991. [Wasserman, 2004] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. New York: Springer, 2001. [Wolpert, 1996] D. Wolpert. The lack of a prior distinction between learning algorithms, Neural Computation, 8: pp. 1341-1390, 1996. [Zadeh, 1965] L. Zadeh. Fuzzy sets, Information and Control 8: 338-353, 1965.
Part XII
Special Problems in Statistics/Computer Science
This page intentionally left blank
NORMAL APPROXIMATIONS Robert J. Boik
1
INTRODUCTION
It often occurs that an investigator is interested in computing probabilities for a random variable, Y , whose distribution is known only partially. If the mean and variance of the random variable are known, then the investigator could use a normal distribution with the same mean and variance as that of Y to approximate the probabilities of interest. Unfortunately, the magnitude of the approximation error generally is unknown and it can be unacceptably large. If Y is the sum or mean of a sequence of component variables and some regularity conditions are satisfied, however, then the magnitude of the approximation error is bounded and the normal approximation can be quite accurate. The conditions under which the distribution of a sum or a mean of a sequence of component variables approaches a normal distribution are described in so called “central limit” theorems. This chapter describes the use of the normal distribution to approximate the probabilities of interest. Section 2 describes several versions of the central limit theorem (CLT). This theorem ensures that, under fairly mild conditions, the distribution of a suitably standardized sum of random variables approaches a normal distribution as the number of random variables that are summed increases. The scope of the CLT is broadened substantially by two additional theoretical results, namely the delta method and Slutsky’s theorem. These two results are described in §3. Sections 4–6 describe various applications of the CLT. In practice, the primary use of the CLT is to approximate the sampling distribution of a test statistic or the sampling distribution of an asymptotically pivotal quantity. It is frequentists rather than Bayesians who use sampling distributions to make inferences about parameter values. Accordingly, the applications in sections 4–6 are frequentist in nature. For example, a frequentist who is interested in the value of a population mean, µ, might obtain a random sample, Y1 , Y2 , . . . , Yn , from the population and use the CLT to approximate the sampling distribution of the sample mean Y . The approximate sampling distribution of Y then could be used to construct a confidence interval for µ. A Bayesian who is interested in a population mean, µ would adopt a different approach. The Bayesian would treat µ as a random variable, rather than as a fixed quantity. Inference about µ would be based on the distribution of µ conditional on the data, Y1 , Y2 , . . . , Yn . This conditional distribution is called the posterior distribution. Conditional distributions are defined in Appendix A.1. Section 7 describes a Bayesian CLT for posterior distributions. Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
1028
Robert J. Boik
The last section describes higher-order expansions that improve on the accuracy of frequentist and Bayesian CLTs. 2
THE CENTRAL LIMIT THEOREM
The normal distribution with mean µ and variance σ 2 is denoted by N(µ, σ 2 ). Its probability density function (pdf) is 2
(1) ϕ(x; µ, σ 2 ) =
e−(x−µ) /(2σ √ σ 2π
2
)
for x ∈ (−∞, ∞).
A random variable Z with distribution N(0, 1) is said to have the standard normal distribution. The pdf of the standard normal distribution is ϕ(z, 0, 1) and, for convenience, this pdf is denoted simply as ϕ(z). The normal distribution also is called the “Gaussian” distribution in honor of K.F. Gauss, even though Gauss was not the first to use this distribution. Nonetheless, it was Gauss [1809] who discovered the intimate connections between the normal distribution and the method of least squares. The recognition that many naturally occurring variables are approximately normally distributed is the reason that Peirce [1873], Galton [1877; 1895], and Lexis [1877] called the distribution “normal” in the first place (see [David, 1995; Kruskal and Stigler, 1997]). The classical CLT usually is attributed to Laplace [1810, although substantial related work had been done before 1810. A description of this earlier work can be found in Adams [1974] and Le Cam [1986]. The CLT was named the central limit theorem by Polya [1920] because of its central role in probability and statistics. The basic form of the theorem is given as Theorem 1. A proof is outlined in Appendix B.1. A square is used to mark the end of a theorem. THEOREM 1 Laplace Central Limit Theorem. Suppose that Y1 , Y2 , . . . , Y∞ is a sequence of independent and identically distributed (iid) random variables each with mean µ and variance σ 2 < ∞. Define Y n , Zn , and FZn (z) as √ n X n Yn − µ def def def 1 Y , Z = , and FZn (z) = P (Zn ≤ z), Yn = i n n i=1 σ where z is any finite constant. Then, lim FZn (z) = Φ(z), where Φ(z) =
n→∞
Z
z
ϕ(u) du,
−∞
ϕ(u) = ϕ(u; 0, 1) is defined in (1), and Φ(z) is the cumulative distribution function (cdf ) of the N(0, 1) distribution. The conclusion of the theorem also can be written as √ n Y n − µ dist Tn − nµ dist √ −→ N(0, 1) or as −→ N(0, 1), as n → ∞, σ σ n
Normal Approximations
1029
Pn dist where Tn = i=1 Yi and the relationship operator −→ denotes convergence in distribution or law. Convergence in distribution is defined in Appendix A.2. See [Ferguson, 1996, part 1; Sen and Singer, 1993, ch. 2] for additional details about modes of convergence. In practice, Theorem 1 is applied by using N(µ, σ 2 /n) as the approximating distribution for Y n or using N(nµ, nσ 2 ) as the approximating distribution for Tn . One point of confusion for beginning statistics students is the distinction between the law of large numbers and the central limit theorem. The weak law of prob large numbers states that under fairly general conditions Y −→ µ as n → ∞, prob where the relationship operator −→ denotes convergence in probability. Inforprob mally, Y −→ µ means that the distribution of Y becomes more and more concentrated near µ as n → ∞. In the limit, the entire distribution is concentrated at one point, namely µ. Convergence in probability is formally defined in Appendix A.2. Students also are taught that the limiting distribution of Y , as n → ∞, is a normal distribution. Some students (usually the better ones) become confused because these two claims are contradictory. The first says that Y approaches a constant, whereas the second says that the distribution of Y approaches a normal distribution and, therefore, Y does not approach a constant. This contradiction is easily resolved by more carefully √ stating the CLT claim. The CLT claim is that the limiting distribution of n Y − µ /σ is a normal distribution, namely N(0, 1). The CLT does not claim that the limiting distribution of Y is normal. Nonetheless, the CLT does imply that if n is finite and large, then the distribution of Y can be accurately approximated by N(µ, σ 2 /n). The limit of this distribution as n → ∞, however is not a normal distribution. Specifically, limn→∞ N(µ, σ 2 /n) is a distribution that has all of its mass on one point, namely µ. That is, ( 0 if t < µ, (2) lim P (Y n ≤ t) = n→∞ 1 if t > µ. Equation (2) is merely a restatement of the law of large numbers. Accordingly, the limiting distribution of Y is not a normal distribution. Less restrictive forms of the CLT than that in Theorem 1 apply to independent but not identically distributed random variables (see Theorem 4 in this chapter). The CLT also has been extended to non-independent random variables and vectors. If a sample of size n is randomly selected from a finite population without replacement, then the sample Y1 , Y2 , . . . , Yn consists of non-independent random variables. This situation is discussed in §5.1 in this chapter. See [Lehmann, 1999, §2.8] for an introduction to other extensions of the CLT to non-independent random variables. These basic and less restrictive theorems are important because many random variables of practical importance (e.g., log likelihood functions) can be represented as the sum of component random variables. Also, in observational studies and experiments, the response variable sometimes can be represented as the sum of more elementary variables. For example, the weight of an organism can be represented as the sum of the weights of the components of the organ-
1030
Robert J. Boik
ism. If these component weights satisfy certain moment conditions (i.e., existence of bounded moments and restrictions on covariances among component weights), then a version of the CLT will apply. A special case of Theorem 1 pre-dates Laplace’s work by nearly 80 years. In the context of gambling problems, De Moivre [1733] found that if Tn ∼ Binomial(n, θ), then T − nθ dist p n −→ N(0, 1). nθ(1 − θ)
This result can be obtained from Theorem 1 by recognizing that Tn has the same distribution as the sum of n identically and independently distributed Bernoulli random variables, each with mean θ and variance θ(1 − θ). In practice, sample sizes are finite rather than infinite. The accuracy of the normal approximation when sample size is finite was addressed by Berry [1941] and Esseen [1945]. They verified that the magnitude of the error induced by 1 the normal approximation is bound by a constant of order O(n− 2 ). Order of magnitude notation O, o, Op and op is defined in Appendix A.2. See [Bishop et al., 1975, ch. 14] for additional details about order of magnitude. The result of Berry and Esseen is the following. THEOREM 2 Berry-Esseen. If the conditions of Theorem 1 hold and the third absolute moment, η3 = E |Yi − µ|3 , is finite, then sup |FZn (z) − Φ(z; 0, 1)|
0, n p √ i 1X h E Qni I Qni ≥ ε n → 0 as n → ∞, n i=1 ( √ √ p √ 1 if Qni ≥ ε n, where I( Qni ≥ ε n) = 0 otherwise,
then,
√
dist n Yn − µn −→ N(0, Σ),
√ √ The condition on E Qni I( Qni ≥ ε n) ensures that for any fixed p-vector t, each individual Var(t′ Yni ) = t′ Σni t for i = 1, . . . , n, is small compared Pvariance, n ′ to their sum, i=1 Var(t Yni ), (see Feller, 1971, p. 264).
1032
Robert J. Boik
3
THE DELTA METHOD AND SLUTSKY’S THEOREM
In statistical practice, one often seeks the asymptotic distribution of a function of sample means, rather than of the means themselves. Example. Suppose that the conditions of Theorem 1 are satisfied and, in addition, the third and fourth moments of Yi are finite. The first two sample moments are Y n and Sn2 , where Y n is defined in Theorem 1 and def
Sn2 =
n
1 X (Yi − Y n )2 . n − 1 i=1
It is readily shown that E(Y n ) = µ and that E(Sn2 ) = σ 2 . Suppose that it is of interest to find the asymptotic distribution of √ Un1 n(Y n − µ) √ (3) Un = = . Un2 n(Sn2 − σ 2 ) To find this distribution, a first step is to express (Y n Sn2 )′ as a function of sample means. Define Wi , Wn , and θ as W n1 def 1 Pn Yi def = n i=1 Wi , W = Wi = , n Yi2 W n2 (4) µ def . and θ = E(Wi ) = µ2 + σ 2 It can be shown that n Var Wn =
σ2 κ3 + 2µσ 2
κ3 + 2µσ 2 , κ4 + 2σ 4 + 4µ(µσ 2 + κ3 )
where κj is the j th cumulant of the distribution of Yi . Cumulants and cumulant generating functions are described in Appendix A.5. It follows from Theorem 3 that dist √ def n Wn − θ −→ N(0, Σ), where Σ = n Var Wn . The random vector Un in (3) can be written in terms of Wn as follows: " ! # W n1 √ µ o n n 2 − . Un = n W n2 − W n1 σ2 n−1
To find the asymptotic distribution of Un , it is necessary to find the asymptotic distribution of a differentiable function of Wn . To accomplish this task, the delta method is useful. Theorem 5 summarizes the delta method. See [Sen and Singer, 1993, pp. 136–137] for an accessible proof. THEOREM 5 Delta Method. Suppose that {Tn }∞ n=1 is a sequence of p-variate √ dist random vectors with asymptotic distribution n(Tn − θ) −→ N(0, Σ) as n → ∞,
Normal Approximations
1033
where θ is a fixed p-vector of parameters and Σ is a p × p covariance matrix. Further, suppose that g(Tn ) is a k × 1 function whose first derivative exists for Tn in an open neighborhood of θ. Denote the k×p matrix of first partial derivatives of g(Tn ) evaluated at Tn = θ by G. That is ∂ g(θ) ∂ g(θ) ∂ g(θ) ∂ g(θ) def ∂ g(Tn ) ··· = = . G = ∂ θ1 ∂ θ2 ∂ θp ∂ T′n Tn =θ ∂ θ′ If 0 < ||G|| < ∞, then
√ dist n [g(Tn ) − g(θ)] −→ N (0, GΣG′ ) .
′
If GΣG is nonsingular, then the conclusion of Theorem 5 also can be written as (GΣG′ )
− 12
√ dist n [g(Tn ) − g(θ)] −→ N(0, Ik ).
In practice, the values of quantities such as GΣG′ are unknown and must be estimated. If G is a continuous function of θ, then it can be shown that − 1 √ dist bΣ bG b ′ 2 n [g(Tn ) − g(θ)] −→ N(0, Ik ), (5) G
b = G(Tn ) and Σ b is a consistent estimator of Σ. Slutsky’s theorem where G (Slutsky, 1925) is useful to verify (5). Proofs of Slutsky’s theorem can be found in Ferguson [1993, ch. 6] and Sen and Singer [1993, pp. 127–131].
THEOREM 6 Slutsky. Let {Yn }∞ n=1 be a sequence of random p-vectors and let {Xn }∞ n=1 be a sequence of random q ×r matrices. Also, let Y be a random p-vector prob
dist
and let C be a q × r matrix of constants. If Yn −→ Y, Xn −→ C, and h(Xn , Yn ) is a continuous function of Xn and Yn (except, possibly, on sets of measure zero), then dist h(Xn , Yn ) −→ h(C, Y).
Special cases that follow from Slutsky’s Theorem include the following. dist
(a) If Xn has dimension p × 1, then then Xn + Yn −→ C + Y. dist
(b) If Xn has dimension q × p, then Xn Yn −→ CY. dist
(c) If Xn has dimension 1 × r, then Yn Xn −→ YC. The importance of Slutsky’s Theorem is that it justifies replacing unknown parameters by consistent estimators of those parameters without affecting the asymptotic distribution of a random quantity. For example, let Z be a random
1034
Robert J. Boik
k-vector with distribution N(0, GΣG′ ). Then, the claim in (5), follows from Slutsky’s Theorem because
− 21 prob √ dist −1 bΣ bG b −→ (GΣG) 2 and n [g(Tn ) − g(θ)] −→ Z G − 1 √ dist −1 bΣ bG b ′ 2 n [g(Tn ) − g(θ)] −→ =⇒ G (GΣG′ ) 2 Z,
− 12
and (GΣG′ )
Z ∼ N(0, Ik ).
Example Revisited. As an illustration of Theorems 5 and 6, re-consider the problem of finding the joint asymptotic distribution of Y n and Sn2 . Let Tn = (Tn1 Tn2 )′ = Wn , where Wn is defined in (4). Define gn (Tn ) and g(Tn ) as def
gn (Tn ) =
Tn1 Tn1 def . and g(Tn ) = n 2 2 Tn2 − Tn1 n−1 Tn2 − Tn1
Note that Un in (3) can be written as Un =
√
√ n [gn (Tn ) − g(θ)] = n [g(Tn ) − g(θ)] + Xn , √ n 0 where Xn = , 2 n − 1 Tn2 − Tn1
Yn − µ n Sn2 − σ 2
=
√
prob
and θ is √ defined in (4). Then, it follows from Theorem 6 and from Xn −→ 0 that Un and n [g(Tn ) − g(θ)] have the same asymptotic distribution. Accordingly, it follows from Theorems 5 and 6 that ′ ′ √ Y n − µ dist Yn − µ Yn − µ dist −1 Ω −→ N(0, Ω) and Q = n −→ χ22 , n Sn2 − σ 2 Sn2 − σ 2 Sn2 − σ 2 2 ∂ g(θ) σ κ3 1 0 , and G = where Ω = GΣG′ = , = −2µ 1 κ3 κ4 + 2σ 4 ∂ θ′ Note that Y n and Sn2 are asymptotically independent only if the skewness coefficient, κ3 /σ 3 , is zero. Furthermore, it follows from Theorem 6 that Ω−1 can be b −1 without affecting the asymptotic distribution of Q, where Ω b is replaced by Ω any consistent estimator of Ω. 4
DISCRETE DISTRIBUTIONS
In this section, normal approximations to discrete distributions are examined. Details are given for the binomial and negative binomial distributions.
Normal Approximations
4.1
1035
Normal Approximation to the Binomial Distribution
First, consider a sequence of independent and identically distributed Bernoulli random variables, each with success probability θ. Denote the Bernoulli random th variables as Y1 , Y2 , . . . , Y trial is a success and n . The value of Yi is 1 if the i P n 0 otherwise. Let Tn = i=1 Yi . The random variable Tn is the total number of successes in n trials. It follows that Tn has a binomial distribution with parameters n and θ. The probability mass function (pmf) for Tn is n t θ (1 − θ)n−t , for t = 0, 1, . . . , n. fTn (t) = P (Tn = t) = t def
def
It is readily shown that µ = E(Yi ) = θ and σ 2 = Var(Yi ) = θ(1 − θ). It follows from the CLT (Theorem 1) that T − nθ dist pn −→ N(0, 1). nθ(1 − θ)
Figure 1 displays the exact pmf of Tn for θ = 0.2 and n = 5, 10, 20, and 50. A normal distribution with mean nµ and variance nσ 2 also is displayed. It is apparent that the normal approximation improves as n increases. To approximate individual probabilities, a continuity correction can be used. Specifically, if t is an integer in {0, 1, . . . , n}, then fTn (t) (6)
= P (Tn = t) = P (t − 0.5 < Tn ≤ t + 0.5) = P (T n ≤ t + 0.5) − P (Tn ≤ t − 0.5) 1 t+0.5−nµ t−0.5−nµ √ √ =Φ −Φ + O n− 2 , nσ 2 nσ 2
where Φ is the cdf of the standard normal distribution. The supremum of the error in the normal approximations to the cdf and to the pmf are listed in Table 1. The last column of Table 1 illustrates that the constant, c, in the Berry-Esseen Theorem (Theorem 2) is less than 0.7975 for this example. The normal approximation to a binomial random variable often is used to make frequentist inferences about the value of the parameter, θ. Suppose that Tn ∼ Binomial(n, θ) for fixed n and unknown θ. It follows from Theorems 1 and 2 that # " |T − nθ| − 0.5 n P p ≤ zα/2 = 1 − α + O n−1/2 , nθ(1 − θ)
where zα is the 100(1−α) percentile of N(0, 1). A confidence interval with nominal confidence coefficient 1 − α can be obtained by inverting the inequality. The result
1036
Robert J. Boik
n = 10
n=5 0.5 0.45
0.3
0.4
0.25
0.35 0.3
f(T ) n
0.2
0.25
0.15
0.2 0.15
0.1
0.1
0.05
0.05 0 −2
−1
0
1
2
3
4
5
6
0 −2
0
2
4
6
8
n = 50
n = 20 0.25
0.2
0.1 0.15
f(Tn) 0.1
0.05 0.05
0 −2
0
2
4
Tn
6
8
10
12
0 0
5
10
Tn
15
Figure 1. Normal Approximation to the Binomial Distribution
Table 1. Error in Approximating the PMF and CDF Supremum Error Actual Distribution n PMF CDF c Binomial 5 0.0864 0.2373 0.3344 10 0.0400 0.1778 0.3540 20 0.0172 0.1296 0.3643 50 0.0074 0.0836 0.3707 Negative 5 0.0202 0.0800 0.1005 Binomial 10 0.0050 0.0565 0.1058 20 0.0023 0.0399 0.1087 50 0.0009 0.0252 0.1105
20
Normal Approximations
1037
is
Ln,α =
Un,α =
P (Ln,α ≤ θ ≤ Un,α ) = 1 − α + O n−1/2 , where q 2 2 + 4(Tn∗ − 0.5) − 4(Tn∗ − 0.5)2 /n − zα/2 zα/2 2Tn∗ − 1 + zα/2 2 ) 2(n + zα/2
2 + zα/2 2Tn∗∗ + 1 + zα/2
( 1 Tn∗ = Tn
,
q 2 zα/2 + 4(Tn∗∗ + 0.5) − 4(Tn∗∗ + 0.5)2 /n
2 ) 2(n + zα/2 ( n−1 if Tn = 0, Tn∗∗ = Tn otherwise, and
,
if Tn = n, otherwise.
Note that (Ln,α , Un,α ) is a frequentist confidence interval; the endpoints are random variables whereas the parameter θ is a fixed, but unknown constant. For example, if n = 20 and Tn = 7 is observed, then the nominal 95% confidence interval is (0.16, 0.59). This interval is correct if θ is captured by the interval and is incorrect otherwise. The actual confidence coefficient that corresponds to the nominal 95% intervals can be obtained by computing the interval for each value of Tn from 0 to n and then summing the probability of Tn over those intervals that capture θ. For example, if θ = 0.2, then the actual confidence coefficient (i.e., coverage) for nominal 95% intervals is 0.993, 0.967, 0.990, and 0.980 for sample sizes n = 5, n = 10, n = 20, and n = 50, respectively.
4.2
Normal Approximation to the Negative Binomial Distribution
For the second application, consider an infinite sequence of independent and identically distributed Bernoulli random variables, each with success probability θ. Again, denote the Bernoulli random variables as Y1 , Y1 , . . . , Y∞ . The value of Yi is 1 if the ith trial is a success and 0 otherwise. Let W be the number of failures before the first success (i.e., Yi = 1) occurs in the sequence. The random variable W is said to have a geometric distribution and its pmf is P (W = w) = (1 − θ)w θ for w = 0, 1, . . . , ∞. Now consider a sequence of independent and identically distributed geometric random variables, W1 , W2 , . . . , Wn . Denote the sum of the geometric random variables by Tn . The random variable Tn has the same distribution as the number of failures before the nth success occurs in a sequence of independent and identically distributed Bernoulli trials. The random variable Tn is said to have a negative binomial distribution and its pmf is t+n−1 n θ (1 − θ)t , for t = 0, 1, . . . , ∞. fTn (t) = P (Tn = t) = t
1038
Robert J. Boik
n = 10
n=5
0.03
0.045 0.04
0.025
0.035
0.02
0.03
f(T ) 0.025 n
0.015
0.02 0.015
0.01
0.01
0.005
0.005 0 −20
0
20
40
60
80
0
0
20
40
60
80
100
n = 50
n = 20 0.012
0.02 0.018
0.01
0.016 0.014
0.008
f(Tn) 0.012 0.006
0.01 0.008
0.004
0.006 0.004
0.002
0.002 0
20
40
60
80
Tn
100
120
140
160
0 100
150
200
T
250
300
n
Figure 2. Normal Approximation to the Negative Binomial Distribution, θ = 0.2 def
def
It is readily shown that µ = E(Wi ) = (1 − θ)/θ and σ 2 = Var(Wi ) = (1 − θ)/θ2 . Accordingly, it follows from Theorem 1 that Tn − n(1 − θ)/θ dist p −→ N(0, 1). n(1 − θ)/θ2
The exact pmf and the normal approximations to the pmf of Tn for θ = 0.2 and n = 5, 10, 20, and 50 are displayed in Figure 2. Table 1 displays the approximation errors. The normal approximation to the pmf was computed using the continuity correction illustrated in (6). The normal approximation to a negative binomial random variable can be used to make frequentist inferences about the value of the parameter, θ. Suppose that Tn ∼ Negative Binomial(n, θ) for fixed n and unknown θ. It follows from Theorems 1 and 2 that " # |T − n(1 − θ)/θ| − 0.5 n p P ≤ zα/2 = 1 − α + O n−1/2 , n(1 − θ)/θ2
where zα is the 100(1−α) percentile of N(0, 1). A confidence interval with nominal confidence coefficient 1 − α can be obtained by inverting the inequality. The result
Normal Approximations
1039
is
Ln,α = Un,α =
P (Ln,α ≤ θ ≤ Un,α ) = 1 − α + O n−1/2 , where q 2 2 /4 n(Tn + 0.5 + n − zα/2 /2) − zα/2 n(Tn + 0.5)(Tn + 0.5 + n) + n2 zα/2
(Tn + 0.5 + n)2 q 2 2 /4 /2) + zα/2 n(Tn∗ − 0.5)(Tn∗ − 0.5 + n) + n2 zα/2 n(Tn∗ − 0.5 + n − zα/2 (Tn∗ − 0.5 + n)2 ( 1 if Tn = 0, ∗ . and Tn = Tn otherwise.
,
,
For example, if n = 20 and Tn = 46 is observed, then the approximate 95% confidence interval is (0.18, 0.41). This interval is correct when θ = 0.2 because θ is captured by the interval. The actual confidence coefficient that corresponds to the nominal 95% intervals can be obtained by computing the interval for each value of Tn from 0 to ∞ and then summing the probability of Tn over those intervals that capture θ. For example, if θ = 0.2, then the actual confidence coefficient (i.e., coverage) for nominal 95% intervals is 0.962, 0.960, 0.954, and 0.953 for sample sizes n = 5, n = 10, n = 20, and n = 50, respectively. 5
RANDOMIZATION INFERENCE
In many circumstances, frequentist statistical inference can be justified by the act of randomization (i.e., random selection and/or random assignment). In this section, normal approximations are applied to two areas in which randomization justifies statistical inference, namely estimating the mean of a finite population (random selection) and testing hypotheses about equality of population distributions (random assignment). These applications are illustrated in §5.1 and §5.2, respectively.
5.1
Inference about the Mean of a Finite Population
The central limit theorems in §2 require that {Yi }∞ i=1 be independently distributed random variables. This requirement is satisfied whenever units are randomly selected from an infinite population or are randomly selected with replacement from a finite population. If n units are selected at random without replacement from a finite population of size N ≥ n, however, then {Yi }ni=1 are not independently distributed. Example Suppose that a sample of size n is selected at random and without replacement from a population containing N units. The mean and variance of the
1040
Robert J. Boik
population are def
µ = E(Yi ) =
N N 1 X 1 X def Yi and σ 2 = Var(Yi ) = (Yi − µ)2 . N i=1 N i=1
Denote the sample mean based on a sample of size n by Y n . Then Y N = µ with probability 1 and this implies that Var(Y N ) = 0. It is readily shown that for any sample size, the joint distribution of Y1 , Y2 , . . . , Yn is exchangeable. Exchangeability is defined in Appendix A.3. Exchangeability implies that E(Yi ) = E(Y1 ) for all i and that Cov(Yi , Yj ) = Cov(Y1 , Y2 ) for all i 6= j. See Berry and Lindgren [1996, §2.6] for additional details. Accordingly, hP i PN N 0 = Var(Y N ) = N12 i6=j Cov(Yi , Yj ) i=1 Var(Yi ) + = N12 N σ 2 + N (N − 1) Cov(Y1 , Y2 )
which implies that Cov(Yi , Yj ) = −σ 2 /(N − 1) for all i 6= j. These nonzero covariances imply that {Yi }ni=1 are not mutually independent. Fortunately, the central limit theorem can be extended to the case of sampling without replacement from a finite population (David, 1938; Madow, 1948; Erd¨ os and R´enyi, 1959; Hajek, 1960). Let P1 , P2 , . . . , P∞ be a sequence of populations such that the N th population consists of units with values YN 1 , YN 2 , . . . , YN N . The mean and the rth central moment of PN are µN =
N N 1 X 1 X YN j and µN r = (YN j − µN )r . N j=1 N j=1
2 2 Note that µN 2 = σN , the variance of PN . It is assumed that σN < ∞ and that higher-order central moments of PN are bounded as N → ∞. Specifically,
r (7) µN r /σN = O(1) for r = 3, 4, . . . as N → ∞,
Let YN 1 , YN 2 , . . . , YN n be a random sample ofP size n selected from PN without ren placement. Consider the linear function Tn = j=1 cN j YN j , where the coefficients n {cN j }j=1 are fixed numbers that are not all zero. It is readily shown that # " 2 n¯ c2n N 2 2 , 1− E(Tn ) = n¯ cn µN and Var(Tn ) = n Vcn + c¯n σN N −1 N Vc2n + c¯2n n
n
1X 1X cN j , and Vc2n = (cN j − c¯n )2 . where c¯n = n j=1 n j=1
Following Madow [1948], it is assumed that the coefficients satisfy 1 1X r . 2 n¯ c2n < Vcn + c¯2n 2 = O(1) for r = 3, 4, . . . and (b) cN j n j=1 N Vc2n + c¯2n 1 − ε, n
(8)
(a)
Normal Approximations
1041
as n → ∞ and N → ∞, where ε is a positive constant that does not depend on either n or N . Condition (a) ensures that no single coefficient dominates the rest and condition (b) ensures that Var(Tn ) > 0. Under these conditions, the CLT in Theorem 1 can be extended to finite populations. The extension is summarized in Theorem 7. THEOREM 7 Finite Population Central Limit Theorem. If (7) and (8) are satisfied then Tn − E(Tn ) dist −→ N(0, 1) Var(Tn ) as n → ∞ and N → ∞.
It is important to note that Theorem 7 requires that both N and n increase. It is not sufficient for only n to increase, regardless of how large N might be (see Plane and Gordon, 1982). To apply Theorem 7 to the sample mean, equate each √ coefficient cN j to 1/ n. Note n/N < 1 as n → ∞ and N → ∞ is sufficient to √ n implies satisfy Madow’s conditions in (8) for this choice of c . Also c = 1/ N j N j √ √ 2 that Tn = n Y n , E(Tn ) = nµN , and Var(Tn ) = σN [1 − (n − 1)/(N − 1)]. It follows from Theorem 7 that √ n Y n − µN dist 1 −→ N(0, 1). 2 [1 − (n − 1)/(N − 1)]} 2 {σN As an illustration, consider the population of size N = 25 displayed in Figure 3. The population values were generated by sampling from a Poisson distribution with mean 4. The mean and variance of the finite population are 2 µN = 4.16 and σN = 4.21. The sampling distribution of Y n based on all samples of size n = 5 selected without replacement is displayed in Figure 4. The 25 = 53,130 samples yield only 29 distinct values of Y n , namely 1.4 to 7.0 in 5 steps of size 0.2. A normal distribution with mean E(Y n ) = µN = 4.16 and vari2 /n) [1 − (n − 1)/(N − 1)] = 0.70 is superimposed on the line ance Var(Y n ) = (σN graph. In this example, the supremum absolute cdf approximation error is 0.048 and the maximum absolute pmf error is 0.002. 2 Using Cov(Xi , Xj ) = −σN /(N − 1), it is readily shown that 2 σN S2 n (n − 1) 1− is unbiased for Var(Y n ) = 1− , n N n (N − 1) Pn where S 2 = (n − 1)−1 j=1 (YN j − Y n )2 is the sample variance. Accordingly, the endpoints of two-sided nominal 100(1 − α)% confidence intervals, based on Theorem 7, can be constructed as 1 2 n 2 S 1− , Y n ± t1−α/2,n−1 n N where t1−α/2,n−1 is the 100(1 − α/2) percentile of the t distribution with n − 1 degrees of freedom. For example, if a sample of size n = 5 is randomly selected
1042
Robert J. Boik
7 6
Frequency
5 4 3 2 1 0
0
2
4
6
y
8
10
Figure 3. Finite Population, N =25
0.5 0.45 0.4
Density
0.35 0.3
0.25 0.2
0.15 0.1 0.05 0 0
2
4 y¯
6
8
Figure 4. Distribution of Y , Sampling from a Finite Population
Normal Approximations
1043
2 and the sample is (2, 3, 3, 6, 9), p then y¯ = 4.6, s = 8.3, and the nominal 95% confidence interval is 4.6 ± 2.776 8.3(1 − .2)/5 = (1.40, 7.80). This specific interval is correct because it does capture the population mean, µ = 4.16. For the population in Figure 3 and samples of size n = 5, the coverage of confidence intervals can be found by computing the proportion of the 25 = 53,130 confidence intervals that 5 capture the population mean. For two-sided intervals with nominal confidence coefficients 90%, 95%, and 99%, the coverage is 0.916, 0.958, and 0.995, respectively.
5.2
Permutation Tests
Fisher [1936; 1937] introduced a class of tests that can be used when the form of the population distribution is unknown. The tests are based on the distribution of the sample conditional on a minimal sufficient statistic. Informally, a minimal sufficient statistic is a function of the sample that provides the greatest data reduction while still preserving all information about the unknown parameters that is contained in the sample. More precise definitions of sufficient statistics and minimal sufficient statistics are given in Appendix A.4. Suppose that Y1 , Y2 , . . ., YN is a random sample from a continuous distribution and that nothing is known about which or what type of continuous distribution the data have been sampled from. In this case, it can be shown that the vector of order statistics T = (Y(1) Y(2) . . . Y(N ) )′ is minimal sufficient, where Y(i) is the ith smallest value in the sample. The joint distribution of Y1 , Y2 , . . . , YN conditional on the order statistics is 1 = N! 0
P (Y1 = y1 , Y2 = y2 , . . . , YN = yN |T = t) if y1 , y2 , . . . , yN is a permutation of t1 , t2 , . . . , tN , otherwise.
That is, each possible rearrangement (permutation) of the sample is equally likely and has probability 1/N ! regardless of which continuous distribution the data have been sampled from. Fisher suggested that this conditional distribution could be used to construct distribution-free hypothesis tests. In this section, a normal approximation to Fisher’s permutation test in a one-way analysis of variance is illustrated. Suppose that N experimental units are randomly assigned to k treatments such that Pk nj units are assigned to treatment j and N = j=1 nj . Denote the responses of the nj units under treatment j by Yij , for i = 1, . . . , nj and denote their common cdf by Fj (y). The form of Fj is unknown, but it is assumed that its mean (µj ) and variance are finite and that the treatment effects, if any, merely shift the distribution up or down on the number line. Accordingly, Fj (y − µj ) = F0 (y − µj ) for j = 1, . . . , k, where F0 is a cdf. The linear model that corresponds to this setting is
1044
Robert J. Boik
(9) Yij = µj + ǫij , for i = 1, . . . , nj and j = 1, . . . , k, where {ǫij } are iid with mean zero, variance σ 2 , and cdf F0 . It is of interest to test the null hypothesis H0 : µ1 = µ2 = · · · = µk . If the k populations are normally distributed, then under H0 , the conventional F statistic has an F distribution with degrees of freedom k − 1 and N − k, where SS Treat /(k − 1) , F = SSE /(N − k) nj 1 X Yij , Yj = nj i=1
N=
k X
nj ,
SS Treat =
k X j=1
j=1
nj (Y j − Y )2 ,
nj k X k nj X 1 XX (Yij − Y j )2 . Yij , and SSE = Y = N j=1 i=1 j=1 i=1
Alternatively, the F statistic can be transformed to a statistic, B that has a Beta distribution with parameters (k − 1)/2 and (N − k)/2, where B=
SS Treat (k − 1)F = and (k − 1)F + N − k SS Total
SS Total = SS Treat + SSE =
nj k X X j=1 i=1
(Yij − Y )2 .
The null hypothesis is rejected for large values of F or, equivalently, for large values of B. Fisher’s permutation test is conducted conditional on the order statistics for the total sample of size N . The conditional sampling distribution of F or B Qkis constructed by computing the value of the test statistic under each of the N !/ j=1 nj ! possible assignments of data to groups. If the null hypothesis is true, then, conditional on the order statistics, each assignment sample values to the k treatments Qof k is equally likely and has probability 1/[N !/ j=1 nj !]. Accordingly, an exact test can be constructed by comparing the observed value of F or B to its conditional sampling distribution. If sample sizes are not small, then construction of the exact conditional sampling distribution may not be feasible. For example, if k = 4 and n1 = n2 = n3 = n4 = 10, then there are 40!/[(10!)4 ] ≈ 4.71 × 1021 possible assignments of data to groups. Because sample sizes are equal across groups, only (1/4!)40!/[(10!)4 ] ≈ 1.96 × 1020 assignments of data to groups need be examined. Nonetheless, computation of the test statistic (F or B) under each of these assignments is a formidable task, even with a high speed computer. In this case, using the CLT to approximate the conditional sampling distribution can save time without inducing appreciable loss in accuracy. This permutation test (exact or normal approximation) must be used with caution, however, even though the form of the distribution, F0 , is arbitrary. Some investigators have employed the permutation procedure to test the hypothesis of equality of means when model (9) is not satisfied. In particular, the permutation test is sometimes used to test equality of population means when population
Normal Approximations
1045
variances differ among the k sub-populations. Boik [1987] showed that the permutation test is not a robust alternative to the conventional F test in this situation. Denote the arithmetic mean of the k sample sizes by n and denote that harmonic mean of the k sample sizes by n e. That is, −1 k k X 1X 1 . n= nj and n e = k k j=1 n j j=1
If population variances are heterogeneous and n e/n is large (i.e., near 1), then the type I error rate for the permutation test is larger than the type I error rate of the normal-theory F test. If population variances are heterogeneous and n e/n is small, then the type I error rate for the permutation test is smaller than the type I error rate of the normal-theory F test. In either case, use of the permutation test rather than the normal theory F test does not ensure better control over type I error rate. Central limit theorems for linear statistics under permutation were obtained by Wald and Wolfowitz [1944], Hoffding [1951] and others. The finite population version of the CLT that was obtained by Madow [1948] also applies to permutation distributions. In the present application, the finite sample version of the CLT ensures that the joint permutation distribution of the scaled sample means converges in distribution to a multivariate normal distribution.
THEOREM 8 Permutation Central Limit Theorem. Denote the conditional variance of the {Yij } values by SN2 . That is SN2 = SS Total /N . Assume that (a) the higher-order central moments of {Yij } are bounded as in (7) and (b) nj /N → λ2j as N → ∞, where λ2j ∈ (0, 1) for j = 1, . . . , k. Denote the k-vector with components λ1 , λ2 , . . . , λk by λ. Define ZN j and ZN as √ ′ nj Y j − Y ZN j = for j = 1, . . . , k, and ZN = ZN 1 ZN 2 · · · ZN k . SN
Then, as N → ∞, the distribution of ZN , conditional on the order statistics T, converges to the singular multivariate normal distribution with mean 0 and covariance matrix Σ = Ik − λλ′ . Note that Σ is idempotent with rank k − 1. It follows from [Christensen, 1996, Theorem 1.3.6] and Theorem 6 in this chapter that conditionally and unconditionally SS Treat dist = N B = Z′N ZN −→ χ2k−1 SS Total /N as N → ∞.
Theorem 8 reveals that the asymptotic permutation distributions of N B and (k − 1)F are identical to their asymptotic distributions when sampling from a normal distribution. In practice, however, the asymptotic χ2 distribution is rarely employed. A more accurate approximation to the permutation distribution of B
1046
Robert J. Boik
can be obtained by matching the exact permutation moments of B to those of a beta distribution. The first two permutation moments of B are µB = (k − 1)/(N − 1), where R =
6 κ b4 + N +1
N (N + 1)
and κ b4 =
σB2 = Pk
j=1
2(N − k)(k − 1) + R, (N 2 − 1)(N + 1)
2 n−1 j − 2(N − k)(k − 1) + (N + 1)k
N (N − 1)(N − 2)(N − 3)
k nj 1 XX (Yij − Y )4 N j=1 i=1
(SS Total /N )2
,
− 3.
The quantity κ b4 is the kurtosis of the {Yij } values. See [Pitman, 1937; Box and Anderson, 1955; Robinson, 1983] for details on computing permutation moments. Matching the conditional moments of B to those of a Beta(α, β) distribution yields k−1 µB (1 − µB ) −1 = 1 + O N −1 and α = µB 2 σ 2 B µB (1 − µB ) N −k β = (1 − µB ) −1 = 1 + O N −1 . 2 σB 2
Under normality, the exact values of α and β are (k − 1)/2 and (N − k)/2 respectively. For example, Table 2 displays a stem and leaf plot of N = 15 values sampled from the χ23 distribution and then rounded to 2 decimal places. Suppose that k = 3 and sample sizes are n1 = 4, n2 = 5, and n3 = 6. The fitted beta parameters are α = 0.958 and β = 5.748. The exact normal-theory values are 1 and 6, respectively. Figure 5 displays the difference between the exact permutation cdf of B and the beta(0.958, 5.748) cdf approximation. The exact permutation cdf is based on all 630,630 possible assignments of data to groups. The absolute difference in Figure 5 is no larger than 0.006, except when b is near zero. If the beta(0.958, 5.748) distribution is used as the reference distribution for testing H0 , then the actual test sizes that correspond to nominal test sizes of 0.10, 0.05, and 0.01 are 0.0982, 0.0506, and 0.0116, respectively. As a specific example, suppose that the n = 15 values in Table 2 are randomly assigned to k = 3 groups and the results in Table 3 are observed. For this data set, SS Total = 40.516, SS Treat = 1.8905, and b = 0.0467. The beta approximation to the p-value is P (B ≥ 0.0467) = 0.742, where B ∼ beta(0.958, 5.748). The exact p-value (conditional on the order statistics) computed from the permutation distribution is P (B ≥ 0.0467|T) = 0.746. 6
LIKELIHOOD-BASED INFERENCE
Suppose that Yi for i = 1, . . . , n are mutually independent observable random p-vectors, where the pdf or pmf for Yi is fi (yi ; θ) and θ is a k-vector of unknown
Normal Approximations
1047
Table 2. Stem and Leaf Plot of Sample of size N = 15 from χ23 . Stem unit is ones, leaf unit is hundredths Stems 0 1 2 3 4 5 6
Leaves 69 16 03 11 00 45 46
19 23 51 19
28 60 95 56
0.006 0.004
FB (b) − FˆB (b)
0.002 0
−0.002 −0.004 −0.006 0
0.2
0.4
b
0.6
0.8
1
Figure 5. Normal Approximation to the Permutation Sampling Distribution of B
Table 3. Data for k = 3 Groups Group 1 Group 2 Group 3 3.95 2.60 4.56 4.00 2.23 2.03 1.19 1.16 1.28 5.45 4.19 3.11 3.51 6.46 0.69 Mean 3.65 2.74 3.02
1048
Robert J. Boik
parameters. The function fi may depend on known covariates. Denote the parameter space for θ by Θ. The likelihood function of θ given the data is defined as the joint pdf or pmf of {Yi }ni=1 given θ. In the present case, the Yi vectors are independent so the likelihood and log likelihood functions are (10) L(θ; y) =
n Y
def
fi (yi ; θ) and ℓ(θ; y) = ln[L(θ; y)] =
n X
ln [fi (yi ; θ)] ,
i=1
i=1
respectively, where y = {yi }ni=1 and yi is a realization of Yi . Note that L(θ; y) is a function of the observed values y, whereas L(θ; Y) is a function of the random variables Y = {Yi }ni=1 . The method of maximum likelihood estimation was introduced by Fisher [1912; 1922; 1925]. A maximum likelihood estimate (mle) of θ is the value of θ that maximizes the likelihood function L(θ; y). The maximum likelihood estimator (MLE) is a random variable and is the maximizer of L(θ; Y). Both the mle and the b provided that the context provides a clear distinction. MLE can be denoted by θ, Many properties of the MLE have been established, including consistency [Doob, 1934; Cram´er, 1946; Wald, 1949], asymptotic normality [Doob, 1936; Cram´er, 1946; Wald, 1943; Le Cam, 1970], and efficiency [Daniels, 1961]. Theorem 9 gives the asymptotic distribution of the MLE under the assumption that {Yi }ni=1 are mutually independent but not necessarily identically distributed. The regularity conditions listed in Appendix C.1 are sufficient to ensure validity of the theorem. A heuristic proof is given in Appendix B.3. Details and complete proofs were given by Bradley and Gart [1962] and Hoadley [1971]. The theorem can be extended to the case where {Yi }ni=1 are not independently distributed [Lehmann and Casella, 1998, §6.7]. THEOREM 9 Asymptotic Normality of MLEs. If the regularity conditions in Appendix C.1 are satisfied, then −1 √ dist b − θ) −→ n(θ N 0, Iθ,∞
b is a where Iθ,∞ is the limit of the average information matrix as n → ∞, θ consistent root of the likelihood equations S(θ; Y) = 0 and S(θ; Y) is the score function. The score function and the average limiting information matrix are defined as def
S(θ; Y) =
n
∂ ℓ(θ; Y) X Si (θ; Yi ), and Iθ,∞ = lim n−1 Iθ,n , where = n→∞ ∂θ i=1 def
Si (θ; Yi ) =
∂ ln [fi (Yi ; θ)] def and Iθ,n = E [S(θ; Y) S(θ; Y)′ ] . ∂θ
In practice, Theorem 9 is employed by using the approximation b∼ θ ˙ N θ, I−1 θ,n .
Normal Approximations
1049
For example, consider a dose-response study in which the outcome is binary. Denote the response of the ith experimental unit by Yi , where Yi = 1 if the drug is effective (i.e., success) and Yi = 0 otherwise. Denote the probability of success for the ith case by τi ; i.e., P (Yi = 1) = τi . Suppose that τi satisfies the logit function: ln[τi /(1 − τi )] = x′i θ, where xi is a k × 1 vector of known covariates and θ is a k × 1 vector of unknown parameter values. These assumptions yield the familiar logistic regression model for Yi for i = 1, . . . , n. It follows that ′
′ , 1 + exi θ n X [Yi ln(τi ) + (1 − Yi ) ln(1 − τi )] , ℓ(θ; Y) =
τi = L(θ; Y) =
n Y
i=1
τiYi (1 − τi )1−Yi ,
S(θ; Y) =
n X i=1
exi θ
i=1
xi (Yi − τi ) , and Iθ,n =
n X i=1
xi τi (1 − τi )x′i .
As an illustration, suppose that k = 2, θ = (−1 0.2)′ , and xi = (1 i)′ with frequency ni = 10 for i = 1, . . . , 5. For this model, the inverse information matrix is 0.492 −0.132 b ≈ I−1 = . Var(θ) θ,n −0.132 0.043 To examine the accuracy of the normal approximation, 100,000 data sets that followed the logistic regression P5 model were randomly generated. The sample size for each data set was n = i=1 ni = 50. The simulation-based means and varib are ances of θ b = −1.058 compared to θ = −1.0 and b θ) E( 0.2 0.213 0.492 −0.132 0.565 −0.152 b = d θ) . compared to I−1 = Var( θ,n −0.132 0.043 −0.152 0.049
It is apparent that the MLEs are slightly biased and that I−1 θ,n slightly underestib mates the covariance matrix of θ. b the asymptotic princiTo capture the bivariate nature of the distribution of θ, b pal components of θ can be examined. Principal components are successive linear functions that have maximum variability subject to being uncorrelated with previous linear functions and subject to a scale constraint on the coefficients of the linear functions. These linear functions are obtained by conducting an eigen-analysis on the asymptotic covariance matrix. The two principal components are PCA1 0.965 −0.263 b θ with mean and variance PCA = = 0.263 0.965 PCA2 0.528 0.000 −1.017 . and Var(PCA) = E(PCA) = 0.000 0.007 −0.070
1050
Robert J. Boik
"True" Density Normal Approximation
0.6
4
0.4
Density
Density
0.5
0.3
3 2
0.2
1
0.1 0 −4
"True" Density Normal Approximation
5
−3
−2 −1 0 First Principal Component
1
2
0 −0.4
−0.3
−0.2 −0.1 0 0.1 Second Principal Component
0.2
0.3
Figure 6. Normal Approximation to the Distribution of MLEs b (using The simulation-based means and variances of the principal components of θ asymptotic coefficients) are 0.607 0.000 −1.076 d b . , and Var(PCA) = E(PCA) = 0.000 0.008 −0.073
Figure 6 displays the “true” and the normal approximations to the pdf of each of the two principal components. The “true” pdfs were estimated by applying a kernel density estimator to the simulation results. With minor exceptions, the normal approximations are reasonably accurate. 7
BAYESIAN POSTERIOR DISTRIBUTIONS
Let θ be a k vector of parameters with parameter space θ ∈ Θ. Suppose that, prior to observing data from the study at hand, the investigator can summarize his/her beliefs about θ as the probability density function h(θ). Let Yi for i = 1, . . . , n be independent random p-vectors with pdf or pmf fi (yi ; θ), where fi may depend on known covariates. The likelihood and log likelihood of θ given the data are given in (10). The distribution of θ conditional on Y = y is called the posterior distribution of θ. Using Bayes Theorem, the posterior distribution can be expressed as follows: h(θ; y) = Z
Θ
h(θ) exp {ℓn (θ; y)}
.
h(θ) exp {ℓn (θ; y)} dθ
In practice, the posterior density can be difficult to evaluate and, therefore, approximations can be useful. It was recognized as early as 1774 [Laplace, 1774] that if θ is suitably scaled and mild regularity conditions are satisfied, then its asymptotic posterior distribution is normal. Proofs of the asymptotic normality of
Normal Approximations
1051
posterior distributions under various conditions were obtained by Walker [1969], Dawid [1970], Heyde and Johnstone [1979], Chen [1985]. and Schervish [1995, §7.4.2]. Extensions to semiparametric and nonparametric likelihoods were given by Shen [2002]. ’indexinvariant prior The asymptotic normal result applied to independent, but not necessarily identically distributed random variables is summarized in Theorem 10. A heuristic proof is sketched in Appendix B.4. The proof is based on an expansion of the log b where θ b is the mle of θ. likelihood function around θ = θ,
THEOREM 10 Bayesian Central Limit Theorem. Under the regularity conditions in Appendix C.2, the asymptotic posterior distribution of θ conditional on the data is the following: √ dist b −→ Jθ,n n(θ − θ) N(0, Ik ) where Jθ,n = n−1 Jθ,n ˆ ˆ , ˆ 1 2
Jθ,n ˆ
∂ 2 ℓn (θ; y) =− ∂ θ ⊗ ∂ θ′
, ˆ θ=θ
b is the maximum likelihood estimate of θ. The matrix J ˆ is the observed and θ θ,n information matrix evaluated at the mle. In practice, Theorem 10 is used by approximating the posterior distribution of θ as b J−1 . θ∼ ˙ N θ, ˆ θ,n Note that the normal asymptotic posterior distribution does not depend on the prior, h(θ). This occurs because as n → ∞, the information about θ that is contained in the data dominates the information about θ that is contained in the prior. If the conditions in Appendix C.2 are not satisfied, then the posterior distribution need not be asymptotically normal. Examples that illustrate the necessity of these regularity conditions were described by Le Cam and Yang [1990, §7.5], and by Gelman et al. [1995; §4.3].
An example, in the spirit of Laplace [1774], concerns the posterior distribution of the binomial parameter. Suppose that n independent trials are conducted, where the outcome of each trial is success or failure. Denote the outcome of trial i by Yi , where Yi = 1 if the trial is a success and Yi = 0 if the trial is a failure. If the probability ofP success, say θ, is constant over trials, then the total number n of successes, Tn = i=1 Yi , has a binomial distribution with parameters n and θ. Suppose that the investigator’s prior beliefs about θ can be summarized as θ ∼ Beta(α, β), where α and β are the two parameters of the beta distribution. Common choices for α and β are (1, 1) which corresponds to the uniform distribution and (0.5, 0.5) which corresponds to Jeffreys’s [1961] invariant prior. The,
1052
Robert J. Boik
exact posterior density of θ is
h(θ; y)
=
=
Z θ
θα−1 (1 − θ)β−1 1
θα−1 (1 − θ)β−1
0 tn +α−1
n Y
θyi (1 − θ)1−yi
i=1 n Y
i=1
θyi (1 − θ)1−yi dθ
(1 − θ)n−tn +β−1 I (θ), B(tn + α, n − tn + β) (0,1)
where B(a, b) is the beta function, tn is the total number of observed successes, and I(0,1) (θ) is the indicator function that takes value 1 if θ ∈ (0, 1) and zero otherwise. Accordingly, θ|y ∼ Beta(tn + α, n − tn + β). From Theorem 10, the asymptotic posterior distribution of θ, after suitable scaling, is √ n θ − θb dist q −→ N(0, 1), b − θ) b θ(1
where θb = tn /n. In practice, the Bayesian central limit theorem can be applied by using h i b n−1 θ(1 b − θ) b . θ|y ∼ ˙ N θ,
Figure 7 displays the exact posterior distribution of θ together with the normal approximation for two cases, namely n = 20, tn = 3 and n = 60, tn = 9. In both cases, α = β = 0.5. The normal approximation is reasonably accurate for the n = 60 case, but not for the n = 20 case. Approximate Bayesian 100(1 − α)% credible intervals are given by s b − θ) b θ(1 θb ± zα/2 , n
where zα is the 100(1 − α) percentile of N(0, 1). Note that these credible intervals are identical to the conventional large-sample frequentist intervals for θ. Nonetheless, the interpretation is quite different. In the credible interval, the endpoints are fixed and θ is random. The probability that θ falls in the fixed interval is approximately 1 − α. 8
EXPANSIONS
Convergence of a random variable in distribution to a normal distribution does not guarantee that the normal approximation will be sufficiently accurate when employed with finite sample sizes. Theorem 2 revealed that the cdf approximation error is of magnitude O(n−1/2 ) and, for small sample size this error can be substantial. Various approaches are available for improving the normal approximation.
Normal Approximations
n = 20,
n X
yi = 3
1053
n = 60, 11
Posterior Density Normal Approximation
6
yi = 9
i=1
i=1
7
n X
Posterior Density Normal Approximation
10 9
5
8
Density
Density
7
4 3
6 5 4
2
3 2
1
1
0 −0.1
0
0.1
0.2 θ
0.3
0.4
0.5
0 0
0.05
0.1
0.15
0.2 θ
0.25
0.3
0.35
0.4
Figure 7. Normal Approximation to the Posterior Distribution This section describes the Edgeworth expansion for the sampling distribution of a random quantity. An expansion of the Bayesian posterior distribution also is described. These expansions reduce the magnitude of the cdf approximation error to O(n−1 ) or even to O(n−3/2 ), depending on how many terms in the expansion are retained.
8.1
Edgeworth Expansions
The expansion described in this section was pioneered by Chebyshev [1890], Edgeworth [1905; 1907] and Charlier [1905]. See [Cram´er, 1972] for additional historical notes. dist Suppose that Wn is a continuous scalar random variable that satisfies Wn −→ N(0, 1). In practice, Wn often is of the form √ √ n(Tn − θ) n(Tn − θ) or Wn = , Wn = σT σ bT
bT2 is a consistent estimator of σT2 . Typically, where n Var(Tn ) → σT2 as n → ∞ and σ the random variable Tn is a function of sample means or a function of an estimator obtained as the solution to an estimating equation (e.g., a MLE). Denote the mean and variance of Wn by µn and σn2 , respectively, and denote the rth cumulant of Wn by κr (Wn ). In particular, κ1 (Wn ) = µn = E(Wn ), 3
def
κ2 (Wn ) = σn2 = Var(Wn ), 4
κ3 (Wn ) = E (Wn − µn ) , and κ4 (Wn ) = E (Wn − µn ) − 3σn4 .
1054
Robert J. Boik
Typically, the first four cumulants of Wn satisfy ω2 ω1 κ1 (Wn ) = √ + O n−3/2 , κ2 (Wn ) = 1 + + O n−2 , n n ω4 ω3 −3/2 , and κ4 (Wn ) = + O n−2 , κ3 (Wn ) = √ + O n n n
(11)
where ωj has magnitude O(1) for j = 1, . . . , 4. An Edgeworth expansion of the distribution of Wn modifies the standard normal approximation such that the first r cumulants (typically 3 or 4) of the approximating distribution match those of Wn . The terms in the expansion are functions of Hermite polynomials. Let z be a scalar constant. Then, the rth Hermite polynomial, Hr (z), is defined as def
Hr (z) =
(−1)r dr ϕ(z) , for r = 1, 2, . . . ϕ(z) (d z)r
where ϕ(z) = ϕ(z, 0, 1) is defined in (1). The first seven Hermite polynomials are the following: H0 (z) = 1, H1 (z) = z, H2 (z) = z 2 − 1, H3 (z) = z 3 − 3z, H4 (z) = z 4 − 6z 2 + 3, H5 (z) = z 5 − 10z 3 + 15z, and H6 (z) = z 6 − 15z 4 + 45z 2 − 15.
The Edgeworth expansion of the distribution of Wn up to order O(n−1 ) is given in Theorem 11. A proof is sketched in Appendix B.5 THEOREM 11 Edgeworth. Regularity conditions for Edgeworth expansions are described by Bhattacharya and Rao [1976, §19-20] and Hall [1992b, §2.4]. If the regularity conditions are satisfied, then the pdf and cdf of Wn can be expanded as follows: "
fn (w) = ϕ(w) 1 +
ω1 H1 (w) √ n
and
"
Fn (w) = Φ(w) − ϕ(w) where g2 = ω2 + ω12 ,
ω1 √ n
+
g2 H2 (w) 2n
+
g2 H1 (w) 2n
+
ω3 H3 (w) √ 6 n
+
ω3 H2 (w) √ 6 n
g4 = ω4 + 4ω1 ω3 ,
+
g4 H4 (w) 24n
+
g4 H3 (w) 24n
+
ω32 H6 (w) 72n
+
ω32 H5 (w) 72n
#
#
+ O n−3/2 , + O n−3/2 ,
ϕ is the standard normal pdf, Φ is the standard normal cdf, and Hj (w) is the j th Hermite polynomial. As an example, suppose that Yi for i = 1, . . . , n are iid χ2ν,λ random variables; i.e., each Yi has a noncentral chi-squared √distribution with ν degree of freedom and noncentrality parameter λ. Let Wn = n(Yn − µ)/σ, where µ = E(Yi ) = ν + 2λ
Normal Approximations 0.04
0.35
Exact Normal Edgeworth
0.3
Normal Edgeworth
0.03 0.02
0.25
0.01 CDF Error
0.2 Density
1055
0.15 0.1
0 −0.01 −0.02 −0.03
0.05 −0.04
0 −0.05 −2
−0.05
0
2
4 y¯
6
8
10
−0.06 −2
0
2
4
y¯
6
8
10
iid
Figure 8. Edgeworth Expansion of the Distribution of Y n , n = 5 and Yi ∼ χ21,1 and σ 2 = Var(Yi ) = 2ν + 8λ. The first four cumulants of Wn are readily shown to be κ1 (Wn ) = 0, κ2 (Wn ) = 1, κ3 (Wn ) = √
4ν + 24λ 2n(ν + 4λ)
3 2
, and κ4 (Wn ) =
12ν + 96λ . n(ν + 4λ)2
Accordingly, the cumulant functions required in Theorem 11 are ω1 = 0,
ω2 = 1,
ω3 = √
4ν + 24λ 2(ν + 4λ)
3 2
, and ω4 =
12ν + 96λ . (ν + 4λ)2
Figure 8 displays the normal approximation and the Edgeworth expansion for iid the pdf of Yn when n = 5, and Yi ∼ χ21,1 . Also shown is the cdf error of the normal approximation and of the Edgeworth expansion. The Edgeworth expansion represents a substantial improvement over the normal approximation. The pdf plot also reveals a problem with the Edgeworth expansion; namely that the approximating function can be negative and therefore it is not a pdf. Hall’s [1992a] cubic transformation remedies this problem for Edgeworth expansions up to order O(n−1/2 ). The Edgeworth and related expansions have been applied to many commonly used test statistics and pivotal quantities. The expansions typically are used to improve control of type I error rate or to improve coverage of confidence intervals. For example, the expansions can be used to improve the coverage of nominal 100(1 − α)% confidence intervals from 1 − α + O(n−1/2 ) to 1 − α + O(n−1 ) or even to 1 − α + O(n−3/2 ). Edgeworth and related expansions have been applied to t statistics [Johnson, 1978], regression coefficients [Boik, 2008a], simple, partial and multiple correlation coefficients [Boik and Haaland, 2006; Ogasawara, 2006a; 2006b], eigenvalues of correlation and covariance matrices [Boik, 2005], estimators of principle components weights and variances [Ogasawara, 2006c], and factor loadings in factor analysis [Ogasawara, 2007]. Boik [2008b] gave a method for
1056
Robert J. Boik
obtaining Edgeworth expansions for parameter estimators when the parameters are subject to constraints. —indexEdgeworth expansion
8.2
Expansion of Posterior Distributions
Various higher-order expansions for the posterior distribution have been proposed (e.g., [Woodroofe, 1992; Sun, 1994; Ghosal & Samanta, 1997; Weng & Woodroofe, 2000]). In this section, the expansion proposed by Johnson [1970] is described. For this expansion, it is assumed that the likelihood function depends on a scalar parameter, θ. Johnson’s expansion is based on a second-order expansion of the log likelihood function evaluated at the mle. To employ Johnson’s expansion for the posterior distribution of θ, the following quantities must be evaluated: h
(j)
dj h(θ) b def (θ) = , (dθ)j θ=θˆ
def
anj =
1 nj!
1 ∂ j ℓ(θ; y) def , ˆ, and σθ2ˆ = − j (∂ θ) 2an2 θ=θ
where h(θ) is the prior pdf and ℓ(θ; y) is the log likelihood function. Johnson showed that√if the required regularity conditions are satisfied, then the posterior b ˆ is cdf of Z = n(θ − θ)/σ θ
h(z; y) = P (Z ≤ z|y) = Φ(z) 3 c31 (z 2 + 2) + c32 c41 (z 5 + 5z 3 + 15z) + c42 (z 3 + 3z) + c43 z √ −ϕ(z) + O n− 2 , + n n (1) b σ 6ˆa2n3 σ ˆh (θ) , , c41 = θ where c31 = σθ3ˆan3 , c32 = θ b 2 h(θ) " # b b σ 2ˆh(2) (θ) an3 h(1) (θ) 4 c42 = σθˆ an4 + , and c43 = θ . b b h(θ) 2h(θ)
Because of a typographical error, the term c02 in equation 2.26 in Johnson needs to be divided by 2 to obtain the above expression for c43 . Figure 9 displays the exact posterior as well as the expansions of the posterior pdfs and cdfs to O(n−1/2 ) and to O(n−1 ) for the example in §7. The expansion of the pdf is obtained by differentiating the expansion of the cdf. The expansions yield substantial improvement over the normal approximation, with the expansion to order O(n−1 ) being most accurate. Higher-order expansions of the posterior density can be used to improve the coverage of nominal 100(1 − α)% credible intervals from 1 − α + O(n−1/2 ) to 1 − α + O(n−1 ) or even to 1 − α + O(n−3/2 ).
Normal Approximations
n = 20,
n X
1057
yi = 3
n = 60,
i=1
7 6
n X
yi = 9
i=1
11
Posterior Density −1/2 Expansion to O(n ) −1 Expansion to O(n )
10 9
Posterior Density Expansion to O(n−1/2) Expansion to O(n−1)
8
5
Density
Density
7
4 3
6 5 4
2
3 2
1
1
0 −0.1
0
0.1
0.2 θ
n = 20,
n X
0.3
0.4
0 0
0.5
0.05
yi = 3
0.1
0.15
n = 60,
i=1
0.05
0.08
0.04
0.06
0.03
0.04
0.02
0.02
0.01
0 −0.02 −0.04 −0.06 −0.08 −0.1 −0.1
n X
0.25
0.3
0.35
0.4
0.35
0.4
yi = 9
i=1
CDF Error
CDF Error
0.1
0.2 θ
0 −0.01 −0.02 −0.03
Normal Approximation −1/2 Expansion to O(n ) Expansion to O(n−1) 0
0.1
−0.04 0.2 θ
0.3
0.4
0.5
−0.05 0
Normal Approximation −1/2 Expansion to O(n ) Expansion to O(n−1) 0.05
0.1
0.15
0.2 θ
Figure 9. Expansion of Posterior Distribution.
0.25
0.3
1058
Robert J. Boik
A
A.1
SELECTED MATHEMATICAL AND STATISTICAL CONCEPTS
Conditional Distributions
Suppose that X is a p-vector of random variables, Y is a q-vector of random variables, and (X, Y) has a joint probability function, fX,Y (x, y), with respect to σ-finite measures. Then the probability function of Y, conditional on X = x, is def
fY|X (y|x) =
fX,Y (x, y) , fX (x)
where fX (x) is the marginal probability function for X. In particular, if Y is a continuous random vector, then fY|X (y|x) is a probability density function (pdf) and if Y is a discrete random vector, then fY|X (y|x) is a probability mass function (pmf).
A.2
Convergence Concepts
Definitions for a few of the most important terms and concepts about convergence are given in this section. 1. Convergence in Probability. Let {Xn }∞ n=1 be a sequence of random variables, prob
let X be a random variable, and let c be a constant. The notation Xn −→ X prob or Xn −→ c is read as Xn converges in probability to X (or to c). Furthermore, prob
Xn −→ X if lim P (|Xn − X| < ǫ) = 1 for every ǫ > 0, and n→∞
prob
Xn −→ c if lim P (|Xn − c| < ǫ) = 1 for every ǫ > 0. n→∞
2. Convergence in Distribution. Let {Xn }∞ n=1 be a sequence of random varidist
ables and let X be a random variable. The notation Xn −→ X is read as Xn converges in distribution (or in law) to X. Denote the cumulative distribution functions of Xn and X by FXn (x) and FX (x), respectively. Then, dist
Xn −→ X if
lim FXn (x) = FX (x)
n→∞
for all x at which FX (x) is continuous. It can be shown that prob
dist
Xn −→ X =⇒ Xn −→ X. ∞ 3. Order of Magnitude. Let {an }∞ n=1 and {bn }n=1 be two sequences of real num∞ bers and let {Xn }n=1 be a sequence of random variables.
Normal Approximations
1059
(a) Little o Notation. The notation an = o(bn ) is read as an is little o of bn and it means that the ratio |an /bn | converges to zero as n → ∞. That is, for any ε > 0, there exists an integer, n(ε), such that n > n(ε) =⇒ |an /bn | < ε. (b) Big O Notation. The notation an = O(bn ) is read as an is big O of bn and it means that the ratio |an /bn | is bounded for large n. That is, there exists a finite number, K, and an integer, n(K), such that n > n(K) =⇒ |an /bn | < K. (c) Little op Notation. The notation Xn = op (1) is read as Xn is little o p prob
of 1 and it means that Xn −→ 0 as n → ∞. That is, for every ε > 0, limn→∞ Pr(|Xn | < ε) = 1, or, equivalently, for every ε > 0 and for every η > 0, there exists an integer, n(ε, η), such that if n > n(ε, η) then Pr(|Xn | < ε) ≥ 1 − η. (d) Big Op Notation. The notation Xn = Op (1) is read as Xn is big O p of 1 and it means that for every η > 0, there exists a number K(η) and an integer n(η) such that if n > n(η), then Pr [|Xn | ≤ K(η)] ≥ 1 − η. (e) Operations on Order of Magnitude. The following relationships follow directly from the definitions: an an = o(bn ) ⇐⇒ an = bn o(1) ⇐⇒ = o(1), bb an = O(1), an = O(bn ) ⇐⇒ an = bn O(1) ⇐⇒ bb an = op (1), and an = op (bn ) ⇐⇒ an = bn op (1) ⇐⇒ bb an an = Op (bn ) ⇐⇒ an = bn Op (1) ⇐⇒ = Op (1). bb
A.3
Exchangeable Distributions
The random variables Y1 , Y2 , . . . , Yk are said to have an exchangeable distribution if their joint probability density function or probability mass function satisfies the following condition: fY1 ,Y2 ,...,Yk (y1 , y2 , . . . , yk ) = fY1 ,Y2 ,...,Yk (y1∗ , y2∗ , . . . , yk∗ ), for all y1 , y2 , . . . , yk , where y1∗ , y2∗ , . . . , yk∗ is any permutation of y1 , y2 , . . . , yk . It is readily shown that if the distribution of Y1 , Y2 , . . . , Yk is exchangeable, then marginal distributions of any order are identical. For example, fY1 (u) = fY2 (u) = · · · = fYk (u) for all u, and
fY1 ,Y2 (u, v) = fY1 ,Y3 (u, v) = fYi ,Yj (u, v) for all u, v and for all i 6= j. It follows that E(Y1 ) = E(Yi ) and Var(Y1 ) = Var(Yi ) for i = 2, 3, . . . , k and that Cov(Y1 , Y2 ) = Cov(Yi , Yj ) for all i 6= j.
1060
A.4
Robert J. Boik
Sufficient Statistics
A well-defined collection of distributions is called a family of distributions and the collection is denoted by F. For example, the collection of all p-dimensional normal distributions is a family. Typically, an investigator obtains a sample of data from some distribution FY (y) ∈ F, where F is known (or assumed), but FY (y) is unknown. In most applications, the investigator uses the sample data in an attempt to determine which member of the family generated the data. Denote the sample by Y = (Y1 Y2 . . . YN )′ and let T = T(Y) be a function of the sample. The statistic T is said to be sufficient if it contains all the information that the sample contains about which member of F generated the data. Technically, the statistic T is said to be sufficient if the distribution of the sample, conditional on T, is identical for every FY (y) ∈ F. If this is satisfied, then after conditioning on T, the sample contains no additional information about which member of F generated the data. For example, consider the family that consists of all Bernoulli distributions. A Bernoulli distribution assigns a probability, say τ , to a success and assigns a probability 1 − τ to a failure for a binary random variable. It can be shown that if N independent and identically distributed binary trials are observed, then T = the number of successes is a sufficient statistic. The particular sequence of successes and failures that yields T successes contains no additional information about which member of the Bernoulli family (i.e., which value of τ ) generated the data. A minimal sufficient statistic is a sufficient statistic that is a function of every other sufficient statistic. For example, consider the family of all univariate normal distributions. It is trivially true that T1 = Y is sufficient. It also is true that T2 = (Y SY2 Ye )′ is sufficient, where Y is the sample mean, SY2 is the sample variance, and Ye is the sample median. The statistic T2 is not minimal sufficient, however, even though T2 is a function of T1 . It can be shown that a minimal sufficient statistic for the normal family is T3 = (Y SY2 )′ . The family of interest in §5.2 is quite large and consists of all univariate continuous distributions. For this family, the minimal sufficient statistic, T, is the set of order statistics. Fisher’s permutation test does not use the sufficient statistic to determine which member of F generated the data. Instead, it uses the distribution of the sample, conditional on T, precisely because this distribution is the same for all members of the family.
A.5
Characteristic and Cumulant Generating Functions
Let Y be a random variable. The characteristic function (CF) and cumulant generating function (CGF) of Y are defined as def
CFY (t) = E eitY
def
and CGFY (t) = ln [CFY (t)] , where i2 = −1.
Let r be a nonnegative integer. If E(|Y |r ) < ∞ then it follows from Theorem 3.5 in Severini [2005, p. 78] that the characteristic function can be expanded as
Normal Approximations
(12) CFY (t) = 1 +
r X (it)j j=1
j!
1061
E(Y j ) + o(tr ) as t → 0, where i2 = −1.
It follows that E(Y j ) =
1 ij
dj CFY (t) for j = 0, 1, . . . , r. (d t)j t=0
Also, Theorem 4.13 in Severini [2005, p. 115] verified that if E(|Y |r ) < ∞, where r is a positive integer, then the cumulant generating function of Y can be expanded as
(13) CGFY (t) =
r X (it)j j=1
j!
κj (Y ) + o(|t|r ) as t → 0,
where κj is called the j th cumulant of Y . The cumulants of Y are polynomial functions of the moments of Y . For example, the first six cumulants can be written in terms of the moments as follows: κ1 = E(Y ) = µ,
2
κ2 = E (Y − µ) = σ 2 , 5
3
κ3 = E (Y − µ) , 6
4
κ4 = E (Y − µ) − 3σ 4 ,
κ5 = E (Y − µ) − 10σ 2 κ3 , and κ6 = E (Y − µ) − 15κ4 σ 2 − 10κ23 − 15σ 6 The j th standardized cumulant of Y is denoted by ρj and is defined as def
ρj =
κj for j = 1, . . . , r. σj
The third standardized cumulant, ρ3 , is a measure of skewness and the fourth standardized cumulant, ρ4 , is a measure of kurtosis.
B
B.1
HEURISTIC PROOFS OF SELECTED THEOREMS
Theorem 1: Central Limit Theorem
Suppose that Yj for j = 1, . . . , ∞ are independent and identically distributed def √ random variables with mean µ and variance σ 2 . Define Zn as Zn = n(Y − µ)/σ,
1062
Robert J. Boik
where Y = n−1
Pn
j=1
Yj . The characteristic function of Zn is
√n(Yn −µ) √ it Pn −it nµ √ Y σ CFZn (t) = E eit = e σ E e nσ j=1 j √ h it in −it nµ √ Y =e σ because {Yj }nj=1 are iid E e nσ √ −it nµ t def =⇒ CGFZn (t) = ln [CFZn (t)] = + n ln CFY √ σ nσ √ −it nµ it (it)2 2 = + n ln 1 + √ µ + (σ + µ2 ) + o n−1 from (12) σ 2nσ 2 nσ √ (it)2 2 it 1 (it)2 2 −it nµ 2 −1 +n √ µ+ (σ + µ ) − µ + o n = σ nσ 2nσ 2 2 nσ 2 ε2 + o(ε2 ) using the Taylor expansion ln(1 + ε) = ε − 2 2 t t2 = − + o(1) =⇒ lim CGFZn (t) = − , n→∞ 2 2 which is the CGF of a N(0, 1) random variable. By the uniqueness of the CF and CGF, it follows that the limiting distribution of Zn is N(0, 1).
B.2
Theorem 5: Delta Method
Expand
√
n(Tn − θ) in a Taylor series around Tn = θ: √ √ √ n(Tn − θ) = n(θ − θ) + G n(Tn − θ) + op (1).
√ dist It follows from Slutsky’s Theorem (Theorem 6) that G n(Tn −θ) −→ N(0, GΣG′ ) √ prob dist and that n(Tn − θ) −→ N(0, GΣG′ ) because op (1) −→ 0 as n → ∞.
B.3
Theorem 9: Asymptotic Normality of MLEs
Interchanging integration and differentiation reveals that E [Si (θ; Yi )] = 0 because (14) Z ∂ fi (yi ; θ) 1 ∂ ln[fi (yi ; θ)] fi (yi ; θ) dyi = fi (yi ; θ) dyi E [Si (θ; Yi )] = ∂θ fi (yi ; θ) ∂θ Z Z ∂ ∂1 ∂ fi (yi ; θ) dyi = = 0. fi (yi ; θ) dyi = = ∂θ ∂θ ∂θ Z
Similar operations on the second derivative of ln [fi (θ; Yi )] reveal that 2 ∂ ln [fi (θ; Yi )] ′ (15) E [Si (θ; Yi ) Si (θ; Yi ) ] = −E . ∂ θ ⊗ ∂ θ′
Normal Approximations
1063
Specifically, Z 2 ∂ ln [fi (θ; yi )] ∂ 2 ln [fi (θ; Yi )] fi (yi ; θ) dyi = − ∂ θ ⊗ ∂ θ′ ∂ θ ⊗ ∂ θ′ Z 1 ∂ fi (yi ; θ) ∂ =− fi (yi ; θ) dyi ∂θ fi (yi ; θ) ∂ θ′ 2 Z 1 1 ∂ fi (yi ; θ) ∂ fi (yi ; θ) ∂ fi (yi ; θ) − fi (yi ; θ) dyi = ∂θ fi (yi ; θ)2 fi (yi ; θ) ∂ θ ⊗ ∂ θ ′ ∂ θ′ Z Z 2 ∂ ln [fi (yi ; θ)] ∂ ln [fi (yi ; θ)] ∂ fi (yi ; θ) = dyi fi (yi ; θ) dyi − ∂θ ∂ θ′ ∂ θ ⊗ ∂ θ′ Z ∂2 fi (yi ; θ) dyi = E [Si (θ; Yi ) Si (θ; Yi )′ ] − ∂ θ ⊗ ∂ θ′ ∂2 1 = E [Si (θ; Yi ) Si (θ; Yi )′ ] . = E [Si (θ; Yi ) Si (θ; Yi )′ ] − ∂ θ ⊗ ∂ θ′ −E
Define S(θ; Y) as n−1 S(θ; Y). Then E
√ √ 1 n S(θ; Y) = 0 and Var n S(θ; Y) = E [S(θ; Y) S(θ; Y)′ ] = Iθ,n , n
where Iθ,n = n−1 Iθ,∞ . Furthermore, it follows from Theorem 3 that (16)
√ dist n S(θ; Y) −→ N(0, Iθ,∞ ), where Iθ,∞ = lim Iθ,n . n→∞
b is the solution to S(θ; Y) = 0 for θ. Expand S(θ; b Y) Recall that the MLE, θ, b around θ = θ to obtain √
b Y) = 0 = n S(θ;
√
√ b − θ) + op (1), n S(θ; Y) − Jθ,n n(θ
where Jθ,n = −
1 ∂ 2 ℓn (θ; Y) . n ∂ θ ⊗ ∂ θ′
By the law of large numbers, Jθ,n = Iθ,n +op (1). Accordingly,
√ √ b − θ) = n S(θ; Y). Iθ,n +op (1) n(θ
It follows from (16) and from Theorem 6 that Iθ,n
√
b − θ) − n(θ
√ √ prob dist b − θ) −→ n S(θ; Y) −→ 0, Iθ,n n(θ N(0, Iθ,∞ ), √ b dist and n(θ − θ) −→ N(0, I−1 θ,∞ ).
1064
B.4
Robert J. Boik
Theorem 10: Bayesian Central Limit Theorem
The proof of Theorem 10 proceeds by expanding the log likelihood function around b and then performing the integration in the denominator. This approach to θ=θ approximating exponential integrals is called the Laplace method (see [Severini, 2000, §2.11]). Under the assumed regularity conditions, the log likelihood function can be expanded as
∂ ℓ (θ; y) n b y) + √ ℓn (θ; y) = ℓn (θ; ′ n∂ θ "
ˆ θ=θ
#
√ b n(θ − θ)
√ 1√ b ′ Jˆ b + Op n|θ − θ| b3 n(θ − θ) n(θ − θ) + θ,n 2 √ √ 1 b y) − b ′ Jˆ b + Op n|θ − θ| b3 , = ℓn (θ; n(θ − θ) n(θ − θ) θ,n 2 2 ∂ ℓn (θ; y) ∂ ℓn (θ; y) def where Jθ,n , because = 0. = − ˆ n ∂ θ ⊗ ∂ θ′ ˆ ∂ θ′ ˆ θ=θ
θ=θ
Substituting this expansion into the posterior distribution yields
h(θ; y) = Z
h(θ) exp {ℓn (θ; y)}
h(θ) exp {ℓn (θ; y)} dθ o n n b + Op n|θ − θ| b3 b ′ J ˆ (θ − θ) h(θ) exp − (θ − θ) θ,n 2 =Z o n n b + Op n|θ − θ| b3 b ′ J ˆ (θ − θ) dθ h(θ) exp − (θ − θ) θ,n 2 Θ Θ
Transforming the densities in the numerator and denominator from θ to un = √ b expanding h(θ) around θ = θ, b and using multivariate Mill’s ratio n(θ − θ),
Normal Approximations
1065
yields h i 1 ′ −1/2 −1/2 b h(θ) + Op n exp − un Jθ,n ˆ un + Op n 2 h(θ; Y) = Z h i 1 ′ −1/2 −1/2 b dun exp − u Jθ,n h(θ) + Op n ˆ un + Op n 2 IRk h i b + Op n−1/2 exp − 1 u′ J ˆ un + Op n−1/2 h(θ) n θ,n 2 = 1/2 k/2 J −1 b h(θ)(2π) + Op n−1/2 ˆ θ,n 1 exp − u′n Jθ,n ˆ un h i 2 = 1 + Op n−1/2 1/2 −1 (2π)k/2 Jθ,n ˆ Z 1/2 1 k/2 −1 u . d u = (2π) because exp − u′n Jθ,n J ˆ ˆ n n θ,n 2 IRk
1 √ dist 2 b −→ n(θ − θ) N(0, Ik ). Accordingly, Jθ,n ˆ
B.5
Theorem 11: Edgeworth Expansion
The Edgeworth expansion is based on properties of Hermite polynomials, defined in §8, and properties of characteristic and cumulant generating functions, defined in §B.1. Severini [2005, Theorem 3.8] verified that if Y is a random variable whose characteristic function, CFY (t) satisfies (17)
Z
∞
−∞
|CF(t)| dt < ∞, then the pdf of Y is fY (y) =
1 2π
Z
∞
e−ity CFY (t) dt.
−∞
Suppose that Z ∼ N(0, 1). Then the characteristic function of Z is CFZ (t) = exp{−t2 /2}. It follows from (17) that ϕ(z) =
1 2π
Z
∞
2
e−itz e−t
/2
dt,
−∞
where ϕ(z) = ϕ(z, 0, 1). Differentiating both sides of the above equality and using the definition of Hermite polynomials (see §8) reveals that dr ϕ(z) 1 (18) = (−1)r ϕ(z)Hr (z) = r (dz) 2π
Z
∞
−∞
1 2
e−itz (−it)r e− 2 t dt.
1066
Robert J. Boik
Also, if r ≥ 1, then Z
z
Hr (u)ϕ(u) du = −Hr−1 (z)ϕ(z) because (19) Z z Z z Z z r (−1)r dr ϕ(u) d ϕ(u) r Hr (u)ϕ(u) du = ϕ(u) du = (−1) du r r ϕ(u) (du) −∞ −∞ −∞ (du) z z dr−1 ϕ(u) = (−1)r (−1)r−1 Hr−1 (u)ϕ(u) = −Hr−1 (z)ϕ(z). = (−1)r r−1 (du) −∞
−∞
−∞
To justify Theorem 11, first use (11) and (13) to expand the cumulant generating function of Wn . The result is (it)2 (it)4 it ω2 (it)3 CGFWn (t) = √ ω1 + 1+ + √ ω3 + ω4 + o n−1 . n 2 n 6 n 24n
Second, use the inversion formula (17) and the above expansion to obtain Z ∞ Z ∞ 1 1 eitw CFWn (t) dt = eitw exp{CGFWn (t)} dt fn (w) = 2π −∞ 2π −∞ Z ∞ 1 (it)2 (it)4 ω2 (it)3 it = 1+ + √ ω3 + ω4 + o n−1 dt. eitw exp √ ω1 + 2π −∞ n 2 n 6 n 24n
Using (it)2 = −t2 and expanding the exponential function yields Z ∞ 1 2 (it)3 it 1 (it)2 (it)4 ω2 + √ ω3 + ω4 + o n−1 dt eitw e− 2 t exp √ ω1 + fn (w) = 2π −∞ 2n 24n n 6 n Z ∞ h it 1 1 2 (it)2 (it)4 (it)3 = ω2 + √ ω3 + ω4 eitw e− 2 t 1 + √ ω1 + 2π −∞ n 2n 6 n 24n i (it)6 2 (it)2 2 (it)4 ω1 + ω1 ω3 + ω3 + o n−1 dt + 2n 6n 72n Z ∞ h 2 it 1 2 (it) (it)4 1 (it)3 = (ω2 + ω12 ) + √ ω3 + (ω4 + 4ω1 ω3 ) eitw e− 2 t 1 + √ ω1 + 2π −∞ n 2n 6 n 24n i (it)6 2 ω3 + o n−1 dt. + 72n Lastly, use (18) to integrate term by term. The result is as follows:
h H3 (z) H1 (z) H2 (z) H4 (z) (ω2 + ω12 ) + √ ω3 + (ω4 + 4ω1 ω3 ) fn (w) = ϕ(w) 1 + √ ω1 + 2n 24n n 6 n i H6 (z) 2 + ω3 + o n−1 . 72n
The Edgeworth expansion for the cdf, Fn (w) is obtained by using (19) to integrate the pdf expansion term by term.
Normal Approximations
C
C.1
1067
REGULARITY CONDITIONS
Asymptotic Normality of MLEs
1. The support of Yi , namely Yi , does not depend on θ. 2. The parameter θ is identifiable. That is, f (Yi ; θ 1 ) = f (Yi ; θ 2 ) ∀ Yi ∈ Yi =⇒ θ 1 = θ 2 . 3. The parameter space, namely Θ, is an open subset in IRk where k = dim(θ). 4. The log likelihood, ℓ(θ; Y), is a third order differentiable function of θ in an open neighborhood in Θ that contains the true value. Furthermore, the order of integration and differentiation can be interchanged. 5. The average information, namely Iθ,n , is finite and positive definite, def
= 0, where Iθ,n = n−1 Iθ,n and lim Iθ,n = Iθ,∞ and lim I−1 n→∞ θ,n " ′ # ∂ ℓ(θ; Y) ∂ ℓ(θ; Y) def . Iθ,n = E ∂θ ∂θ
n→∞
6. Denote the true value of θ by θ 0 . If θ is in a neighborhood of θ 0 , then
∂ 3 ℓi (θ)
∂ θ ⊗ ∂ θ ′ ⊗ θ ′ < Mi (Yi ), where Mi (Yi ) satisfies Eθ0 [Mi (Yi )] < ∞.
7. The score function for the ith observation, namely def
Si (θ; Yi ) =
∂ ln [fi (Yi ; θ)] , ∂θ
satisfies the Lindeberg-Feller condition (see Theorem 4). Specifically, for every ε > 0, n p √ i 1X h E Qni I( Qni ≥ nε) → 0 as n → ∞, where n i=1 −1
Qni = Si (θ; Yi )′ Iθ,n Si (θ; Yi ).
C.2
Asymptotically Normal Posterior Distributions
The following conditions are adapted from [Schervish, 1995, §7.4]. 1. The parameter space is Θ ∈ IRk , where k is finite.
1068
Robert J. Boik
2. The point θ 0 is an interior point of Θ. 3. The prior distribution has a density with respect to Lebesgue measure that is positive and continuous at θ 0 . The prior density is denoted by h(θ). 4. The log likelihood function, ℓn (θ; y), is twice continuously differentiable in a neighborhood of θ 0 . 5. The observed information matrix is defined as def
Jθ,n = −
∂ 2 ℓn (θ; y) . ∂ θ ⊗ ∂ θ′
The observed information matrix evaluated at the MLE is denoted by Jθ,n ˆ . prob
Denote the smallest eigenvalue of Jθ,n by λn,1 . Then λ−1 ˆ n,1 −→ 0 as n → ∞. 6. Let λn,k be the largest eigenvalue of Jθ,n and denote an open ball of radius ˆ δ around θ 0 by N0 (δ). Denote the part of Θ that does not include N0 (δ) by Θc0 . If N0 (δ) ⊆ Θ, then there exists K(δ) > 0 such that ( ) lim Pθ0
n→∞
sup λ−1 n,k [ℓn (θ; y) − ℓn (θ 0 ; y)] < −K(δ)
θ∈Θc0
= 1.
∗ 7. Denote the smallest eigenvalue of J−1 ˆ Jθ,n by λn,1 . For each ǫ > 0, there θ,n exists δ(ǫ) > 0 such that !
lim Pθˆ 0
n→∞
sup
θ∈N0 (δ(ǫ))
|1 − λ∗n,1 | < ǫ
= 1.
ACKNOWLEDGEMENTS I thank an anonymous referee for comments on earlier versions of this chapter. The comments were very helpful in improving the chapter. BIBLIOGRAPHY [Adams, 1974] W. J. Adams. The Life and Times of the Central Limit Theorem, New York: Kaedmon Publishing Company, 1974. [Bhattacharya and Rao, 1976] R. N. Bhattacharya and R. R. Rao. Normal Approximation and Asymptotic Expansions, New York: John Wiley & Sons, 1976. [Berry, 1941] A. C. Berry. The accuracy of the Gaussian approximation to the sum of independent variates. Transactions of the American Mathematical Society, 49, 122–136, 1941. [Berry and Lindgren, 1996] D. A. Berry and B. W. Lindgren. Statistics Theory and Methods, second edition, Belmont CA: Duxbury Press, 1996. [Bishop et al., 1975] Y. M. M. Bishop, S. E. Fienberg, and P. W. Holland. Discrete Multivariate Analysis: Theory and Practice, Cambridge MA: MIT Press, 1975.
Normal Approximations
1069
[Boik, 1987] R. J. Boik. The Fisher-Pitman permutation test: A non-robust alternative to the normal theory F test when variances are heterogeneous. British Journal of Mathematical and Statistical Psychology, 40, 26–42, 1987. [Boik, 2005] R. J. Boik. Second-order accurate inference on eigenvalues of covariance and correlation matrices. Journal of Multivariate Analysis, 96, 136–171, 2005. [Boik, 2008a] R. J. Boik. Accurate confidence intervals in regression analysis of non-normal data. Annals of the Institute of Statistical Mathematics, 60, 61–83, 2008. [Boik, 2008b] R. J. Boik. An implicit function approach to constrained optimization with applications to asymptotics. Journal of Multivariate Analysis, 99, 465–489, 2008. [Boik and Haaland, 2006] R. J. Boik and B. Haaland. Second-order accurate inference on simple, partial, and multiple correlations. Journal of Modern Applied Statistical Methods, 5, 283–308, 2006. [Box and Anderson, 1955] G. E. P. Box and S. L. Anderson. Permutation theory in the derivation of robust criteria and the study of departures from assumptions. Journal of the Royal Statistical Society, Series B, 17, 1–26, 1955. [Bradley and Gart, 1962] R. A. Bradley and J. J. Gart. The asymptotic properties of ML estimators when sampling from associated populations. Biometrika, 49, 205–214, 1962. ¨ [Charlier, 1905] C. V. L. Charlier. Uber das Fehlergesetz. Arkiv f¨ or Matematik, Astronomi och Fysik, 2, No. 8, 1905. [Chebyshev, 1890] P. L. Chebyshev. Sur deux th´ eor` emes relatifs aux probabilit´ es. Acta Mathematica, 14, 305–315, 1890. [Chen, 1985] C. F. Chen. On asymptotic normality of limiting density functions with Bayesian implication. Journal of the Royal Statistical Society, Series B, 47, 540–546, 1985. [Chirstensen, 1996] R. Christensen. Plane Answers to Complex Questions, New York: SpringerVerlag, 1996. [Cram´ er, 1946] H. Cram´ er. Mathematical Methods of Statistics. Princeton New Jersey: Princeton University Press, 1946. [Cram´ er, 1972] H. Cram´ er. Studies in the history of probability and statistics. XXVIII. On the history of certain expansions used in mathematical statistics. Biometrika, 59, 205–207, 1972. [Daniels, 1961] H. E. Daniels. The asymptotic efficiency of a maximum likelihood estimator. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley: University of California Press, 1961. [David, 1938] F. N. David. Limiting distributions connected with certain methods of sampling human populations. Statistical Research Memoirs, 2, 69–90, 1938. [David, 1995] H. A. David. First (?) occurrence of common terms in mathematical statistics. American Statistician, 49, 121–133, 1955. [Dawid, 1970] A. P. Dawid. On the limiting normality of posterior distributions. Proceedings of the Cambridge Philosophical Society, 67, 625–633, 1970. [de Moivre, 1733] A. de Moivre. Approximatio ad Summam terminorum Binomii a + bn in Seriem expansi, privately printed and dated November 12, 1733. [Doob, 1934a] J. L. Doob. Probability and statistics. Transactions of the American Mathematical Society, 36, 759–775, 1934. [Doob, 1934b] J. L. Doob. Statistical estimation. Transactions of the American Mathematical Society, 39, 410–421, 1934. [Edgeworth, 1905] F. Y. Edgeworth. The law of error. Transactions of the Cambridge Philosophical Society, 20, 36–65, 113–141, 1905. [Edgeworth, 1907] F. Y. Edgeworth. On the representation of statistical frequency by a series. Journal of the Royal Statistical Society, 70, 102–106, 1907. [Erd”os and R´ enyi, 1959] P. Erd¨ os and A. R´ enyi. On the central limit theorem for samples from a finite population. Publications of the Mathematics Institute of the Hungarian Academy of Science, 4, 49–57, 1959. [Esseen, 1945] G. Esseen. Fourier analysis of distribution functions. A mathematical study of the Gauss-Laplacian law. Acta Mathematica, 77, 1–125, 1945. ¨ [Feller, 1935] W. Feller. Uber den Zentralen Grenswertsatz der Wahrscheinlichkeitsrechnung, Mathematische Zeitschrift, 40, 512–559, 1935. [Feller, 1971] W. Feller. Introduction to Probability Theory and its Applications, Vol II, 2nd ed. New York: John Wiley & Sons, 1971. [Fisher, 1912] R. A. Fisher. On an absolute criterion for fitting frequency curves. Messenger of Mathematics, 41, 155–160, 1912.
1070
Robert J. Boik
[Fisher, 1922] R. A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, Ser A, 222, 309–368, 1922. [Fisher, 1925] R. A. Fisher. Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22, 700–725, 1925. [Fisher, 1936] R. A. Fisher. Statistical Methods for Research Workers, Edinburgh: Oliver & Boyd, 1936. [Fisher, 1937] R. A. Fisher. The Design of Experiments, Edinburgh: Oliver & Boyd 1937. [Galton, 1877] F. Galton. Typical laws of heredity. Nature, 15, 492–495, 512–514, 532–533, 1877. [Galton, 1895] F. Galton. Natural Inheritance, London: MacMillan, 1895. [Gauss, 1809] K. F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Arbientium, Hamburg: Perthes and Besser 1809. [Gelman et al., 1995] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis, London: Chapman & Hall, 1995. [Ghosal and Samanta, 1997] S. Ghosal and T. Samanta. Asymptotic expansions of posterior distributions in nonregular cases. Annals of the Institute of Statistical Mathematics, 49, 181– 197, 1997. [Hajek, 1960] J. Hajek. Limiting distribution in simple random sampling from a finite population. Publications of the Mathematics Institute of the Hungarian Academy of Science, 5, 361–374, 1960. [Hall, 1992a] P. Hall. On the removal of skewness by transformation. Journal of the Royal Statistical Society, Series B, 54, 221–228, 1992. [Hall, 1992b] P. Hall. The Bootstrap and Edgeworth Expansion, New York: Springer-Verlag, 1992. [Heyde and Johnstone, 1979] C. C. Heyde and I. M. Johnstone. On asymptotic posterior normality for stochastic processes. Journal of the Royal Statistical Society, Series B, 41, 184–189, 1979. [Hoadley, 1971] B. Hoadley. Asymptotic properties of maximum likelihood estimators for the independent not identically distributed case. The Annals of Mathematical Statistics, 42, 1977– 1991. [Jeffreys, 1961] H. Jeffreys. Theory of Probability, third edition. New York: Oxford University Press, 1961. [Johnson, 1978] N. J. Johnson. Modified t tests and confidence intervals for asymmetrical populations. Journal of the American Statistical Association, 73, 536–544, 1978. [Johnson, 1970] R. A. Johnson. Asymptotic expansions associated with posterior distributions. Annals of Mathematical Statistics, 41, 851–864, 1970. [Kruskal and Stigler, 1977] W. H. Kruskal and S. M. Stigler. Normative terminology ‘normal’ in statistics and elsewhere. In B.D. Spencer (Ed.), Statistics and Public Policy, pp. 85–111, Oxford: Oxford University Press, 1977. [Laplace, 1774] P. S. Laplace. M´ emorie sur la Probabilit´ e de causes par les ´ ev` enements. M´ emories de Math´ ematique et de Physique, Present´ es ` a l’Academie Royal des Sciences par divers Savans & lˆ us dans ses Assembl´ ees, 6, 621–656. (English translation by S.M. Stigler printed in Statistical Science, 1986, 1, 364–378, 1774. [Laplace, 1810] P. S. Laplace. Memorie ´ sur les approximations des formules qui sont fonctions de tre`s grands nombres et sur leur application aux probabilit´ es. M´ emories de l’Acad´ emie des Sciences de Paris, 1809: 353–415, 1810. [Le Cam, 1970] L. Le Cam. On the assumptions used to prove asymptotic normality of maximum likelihood estimates. The Annals of Mathematical Statistics, 41, 802–828, 1970. [Le Cam, 1986] L. Le Cam. The central limit theorem around 1935, Statistical Science, 1, 78–96, 1986. [Le Cam and Yang, 1990] L. Le Cam and G. L. Yang. Asymptotics in Statistics Some Basic Concepts, New York: Springer-Verlag, 1990. [Lehmann, 1999] E. L. Lehmann. Elements of Large-Sample Theory, New York: SpringerVerlag, 1999. [Lexis, 1877] W. Lexis. Theorie der Massenerscheinungen in der menschlichen Gesellschaft, Freiburg: Wagner, 1877. [Lindeberg, 1922] J. W. Lindeberg. Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung, Mathematische Zeitschrift, 15, 211–225, 1922.
Normal Approximations
1071
[Madow, 1948] W. G. Madow. On the limiting distribution of estimates based on samples from finite universes. Annals of Mathematical Statistics, 19, 535–545, 1948. [Ogasawara, 2006a] H. Ogasawara. Asymptotic expansion and conditional robustness for the sample multiple correlation coefficient under nonnormality. Communications in Statistics— Simulation and Computation, 35, 177–199, 2006. [Ogasawara, 2006b] H. Ogasawara. Asymptotic expansion of the sample correlation coefficient under nonnormality. Computational Statistics and Data Analysis, 50, 891–910, 2006. [Ogasawara, 2006c] H. Ogasawara. Higher-order asymptotic standard error and asymptotic expansion in principal component analysis. Communications in Statistics—Simulation and Computation, 35, 201–223, 2006. [Ogasawara, 2007] H. Ogasawara. Asymptotic expansion of the distributions of the estimators in factor analysis under normality. British Journal of Mathematical and Statistical Psychology, 2007. [Peirce, 1873] C. S. Peirce. On the theory of errors of observations. Of the Report of the Superintendent of the U.S. Coast Survey for the year ending June 1870, appendix No. 21, pp. 200–224 and plate 27. Reprinted in S.M. Stigler (Ed.), 1980, American Contributions to Mathematical Statistics in the Nineteenth Century, New York: Arno Press, 1973. [Pitman, 1937] E. J. G. Pitman. Significance tests that may be applied to samples from any populations III: The analysis of variance. Biometrika, 29, 322–335, 1937. [Plane and Gordon, 1982] D. R. Plane and K. R. Gordon. A simple proof of the nonapplicability of the central limit theorem to finite populations. The American Statistician, 36, 175–176, 1982. [Polya, 1920] G. Polya. Uber den Zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung und das Moment problem. Mathematische Zeitschrift, 8, 171–178, 1920. [Robinson, 1983] J. Robinson. Approximations to some test statistics for permutation tests in a completely randomized design. Australian Journal of Statistics, 25, 358–369, 1983. [Schervish, 1995] M. Schervish. Theory of Statistics, New York: Springer-Verlag, 1995. [Sen and Singer, 1993] P. K. Sen and J. M. Singer. Large Sample Methods in Statistics An Introduction with Applications, London: Chapman & Hall, 1993. [Severini, 2005] T. Severini. Elements of Distribution Theory, New York: Cambridge University Press, 2005. [Shen, 2002] X. Shen. Asymptotic normality of semiparametric and nonparametric posterior distributions. Journal of the American Statistical Association, 97, 222–235, 2002. [Slutsky, 1925] E. E. Slutsky. Uber atochastiche Asymptoten und Grenzwerte. Metron, 5, 3–89, 1925. [Sun, 1994] D. Sun. Integrable expansions for posterior distributions for a two-parameter exponential family. Annals of Statistics, 22, 1808–1830, 1994. [van Beek, 1972] P. van Beek. An application of Fourier methods to the problem of sharpening the Berry-Esseen inequality. Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete, 23, 187–196, 1972. [Wald, 1943] A. Wald. Tests of hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54, 426–482, 1943. [Wald, 1949] A. Wald. Note on the consistency of the maximum likelihood estimate. The Annals of Mathematical Statistics, 20, 595–601, 1949. [Wald and Wolfowitz, 1944] A. Wald and J. Wolfowitz. Statistical tests based on permutations of the observations. Annals of Mathematical Statistics, 15, 358–372, 1944. [Walker, 1969] A. M. Walker. On the asymptotic behavior of posterior distributions. Journal of the Royal Statistical Society, Series B, 31, 80–88, 1969. [Weng and Woodroofe, 2000] R. C. Weng and M. Woodroofe. Integrable expansions for posterior distributions for multiparameter exponential families with applications to sequential confidence levels. Statistica Sinca, 10, 693–713, 2000. [Woodroofe, 1992] M. Woodroofe. Integrable expansions for posterior distributions for oneparameter exponential families. Statistica Sinca, 2, 91–111, 1992.
This page intentionally left blank
STEIN’S PHENOMENON Richard Charnigo and Cidambi Srinivasan
1
INTRODUCTION
One of the first inferential principles students learn in a beginning statistics course is that they can estimate the mean of a population using the mean of a random sample drawn from that population. In the special case that the sample size is one, the population mean may be estimated by the value of the single observation drawn from the population. Suppose, however, that we wish to estimate not just the mean of one population but rather the means of several distinct populations. Let p denote the number of population means that we wish to estimate. Let these means be denoted by µ1 , µ2 , . . . , µp and the corresponding population standard deviations by σ1 , σ2 , . . . , σp . Let X1 denote the mean of the sample of size n1 drawn from the first population (or, if the sample is of size one, the value of the single observation), X2 the mean of the sample of size n2 drawn from the second population, and so forth. Our intuition tells us that we should estimate µ1 by X1 , µ2 by X2 , and so forth. In particular, our intuition suggests that X2 through Xp should be of no use in estimating µ1 because the populations are distinct. And, had we been asked to estimate only µ1 , our intuition to use only X1 — in other words, to disregard X2 through Xp — would have been correct. Indeed, ample statistical justifications have been offered to support the use of a sample mean to estimate a single unknown population mean (or to predict the next observation from a population). Gauss [1821; 1823; 1826] showed that if the observations drawn from a population follow a normal (or bell-shaped) distribution, the sample mean of the observations — which is then also normally distributed — is the best estimator of the population mean in the sense of minimizing the sum of squared deviations between the individual observations and the estimator. Satisfaction of this least squares criterion implies that the sample mean is, in the case of normally distributed observations, what statisticians refer to as the maximum likelihood estimator of the population mean. Subsequently, several eminent statisticians (for example, [Fisher, 1925; Neyman, 1937; Rao, 1945; 1967; Wald, 1949; Hodges and Lehmann, 1950; Blyth, 1951; LeCam, 1953]) rigorously demonstrated by various criteria the optimality Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
1074
Richard Charnigo and Cidambi Srinivasan
of the sample mean or, more generally, of the maximum likelihood estimator and advanced statistical theory or methodology accordingly. Thus, for much of the 20th century, there was a general belief that sample means should also be optimal when one wanted to estimate several population means simultaneously. The insight of Charles Stein changed all that. Before proceeding to Stein’s insight, we provide a few details about maximum likelihood to make this article self-contained. The likelihood function evaluated at a1 , a2 , . . . , ap is roughly proportional to the probability that X1 , X2 , . . . , Xp (here assumed to be continuous random variables) fall within a small neighborhood of a1 , a2 , . . . , ap , L(a1 , a2 , . . . , ap ) := lim P(aj − ǫ/2 ≤ Xj ≤ aj + ǫ/2 for 1 ≤ j ≤ p)/ǫp , ǫ→0
where L represents likelihood and P represents probability. If X1 , X2 , . . . , Xp are normally distributed with means µ1 , µ2 , . . . , µp and a common standard deviation of 1, then the likelihood function has the form (1) L(a1 , a2 , . . . , ap ) =
p Y
(2π)−1/2 exp −0.5(µj − aj )2 .
j=1
Pp Clearly, maximizing (1) is the same as minimizing j=1 (µj − aj )2 , which is accomplished when µj = aj for 1 ≤ j ≤ p. This suggests that, if X1 , X2 , . . . , Xp assume the numerical values a1 , a2 , . . . , ap and we don’t know µ1 , µ2 , . . . , µp , then the most plausible guesses for µ1 , µ2 , . . . , µp based on the data are a1 , a2 , . . . , ap . We refer to the numbers a1 , a2 , . . . , ap as maximum likelihood estimates and to the underlying random variables X1 , X2 , . . . , Xp as maximum likelihood estimators. For brevity, we also refer to X1 , X2 , . . . , Xp as maximum likelihood estimates when we have in mind that they are assuming numerical values, even if we do not specify those numerical values explicitly. Besides having an evidentiary interpretation [Royall, 1997], maximum likelihood estimation enjoys several desirable properties in statistical inference, including asymptotic consistency, normality, and efficiency. Readers interested in learning more about maximum likelihood are directed to other articles in this book and to texts such as [Cox and Hinkley, 1974].
2
STEIN’S INSIGHT
In his 1956 paper, Stein established an amazing result contradicting the general belief that sample means should be optimal when estimating several population means simultaneously, even with normally distributed observations. He showed that, when p ≥ 3, better estimators of µ1 , µ2 , . . . , µp were available than X1 , X2 , . . . , Xp in the sense of minimizing the expected squared Euclidean distance between the estimators and their targets.
Stein’s Phenomenon
1075
√ √ √ Let us assume that σ1 / n1 = σ2 / n2 = · · · = σp / np = 1 and that X1 , X2 , . . . , Xp are normally distributed. The first assumption, which is mainly for convenience, says that the random variables X1 , X2 , . . . , Xp have a common standard deviation of 1. Note that the standard deviations of the random variables X1 , X2 , . . . , Xp are smaller than the standard deviations of the corresponding populations (except when the sample size is one). Indeed, this reduction of variation provided a reason to advocate the use of sample means. Yet, Stein [1956] showed that the estimators ! b Pp (2) µ b1 := 1 − X1 , a + j=1 Xj2 1−
µ b2 := µ bp :=
a+
1−
b Pp
a+
j=1
b Pp
Xj2
j=1
satisfied p p X X 2 2 (3) E (b µj − µj ) < E (Xj − µj ) j=1
!
Xj2
X2 , . . . , !
Xp
j=1
for suitably chosen nonnegative constants a and b depending on p ≥ 3. Above, the E symbol represents expected value, which can be conceptualized as a long-term average if the processes of sampling and constructing estimators could be repeated an unlimited number of times. The b symbol denotes an estimator. Inequality (3) is true regardless of the numerical values for µ1 , µ2 , . . . , µp , which implies a sort of uniform superiority of the estimators in (2). This inequality is referred to as Stein’s phenomenon or, sometimes, Stein’s paradox. The paradox is that X2 through Xp have no apparent connection to µ1 , but nonetheless X2 through Xp can enhance the estimation of µ1 — as long as the estimation of µ1 is part of the broader goal of estimating µ1 , µ2 , . . . , µp simultaneously. This capacity to borrow strength from unrelated data is all the more remarkable because the unrelated data need not have a common subject matter interpretation. An entertaining quotation from Robbins [1951] drives home the message: “X1 could be an observation on a butterfly in Ecuador, X2 on an oyster in Maryland, X3 the temperature of a star, and so on.” Indeed, Robbins anticipated the possibility of improving upon maximum likelihood estimators. He did so in a study on compound decision problems that focused on inferences about binary-valued µ1 , µ2 , . . . , µp . Specifically, Robbins assumed that each µj , 1 ≤ j ≤ p, was known to have one of the two values +1 or −1 and that Xj , 1 ≤ j ≤ p, was normally distributed with mean µj and standard deviation 1. He considered the statistical problem of
1076
Richard Charnigo and Cidambi Srinivasan
deciding whether the value of µj was +1 or −1 based on the data X1 , X2 , . . . , Xp , with the proportion of incorrect decisions as the loss (a measure of “badness” of the decisions). Denoting the signum of x by sgn(x), the maximum likelihood estimators of µ1 , µ2 , . . . , µp are sgn(X1 ), sgn(X2 ), . . ., sgn(Xp ). They provide natural decision rules asserting that the value of µj is sgn(Xj ) for 1 ≤ j ≤ p. The expected loss, or risk, for each of these decision rules is 0.1587. Note that the maximum likelihood decision rule for µj depends solely on Xj and not on X1 , X2 , . . . , Xj−1 , Xj+1 , . . . , Xp . Using an elegant argument, Robbins developed P competing decision rules asp serting that the value of µj is sgn(Xj − X ∗ ) if | j=1 Xj /p| < 1, that it is Pp Pp ∗ := Xj /p ≥ 1, and that it is −1 if +1 if j=1 Xj /p ≤ −1, where X j=1P Pp p 0.5 log[(1 + j=1 Xj /p)/(1 − j=1 Xj /p)]. The decision rule for µj now depends not only on Xj but also on X1 , X2 , . . . , Xj−1 , Xj+1 , . . . , Xp . These decision rules outperform the maximum likelihood decision rules as p → ∞, in that the expected loss tends to a number strictly less than 0.1587, unless the proportion of µj equaling +1 tends to 1/2 as p → ∞, in which case the expected loss tends to 0.1587 exactly. On the other hand, Stein was the first author to establish that the maximum likelihood estimators X1 , X2 , . . . , Xp could be surpassed in problems involving realvalued µ1 , µ2 , . . . , µp and a finite p. Stein’s subsequent collaboration with James [1961] yielded the following concrete prescriptions for estimators of real-valued µ1 , µ2 , . . . µp : ! (p − 2) (4) µ b1 := 1 − Pp X1 , 2 j=1 Xj µ b2 :=
(p − 2) 1 − Pp 2 j=1 Xj
!
(p − 2) 1 − Pp 2 j=1 Xj
µ bp :=
X2 , . . . , !
Xp
and, as discussed by Efron and Morris [1973], ! ! (p − 3) (p − 3) ¯ (5) µ b1 := Pp ¯ 2 X + 1 − Pp (Xj − X) ¯ 2 X1 , j=1 (Xj − X) j=1 µ b2 := µ bp :=
(p − 3) Pp ¯ 2 j=1 (Xj − X)
!
(p − 3) Pp ¯ 2 j=1 (Xj − X)
¯+ X !
¯+ X
(p − 3) 1 − Pp ¯ 2 j=1 (Xj − X)
!
(p − 3) 1 − Pp ¯ 2 j=1 (Xj − X)
X2 , . . . , !
Xp ,
Stein’s Phenomenon
1077
¯ is the arithmetic average of all of the samwhere the so-called grand mean X Pp ple means, j=1 Xj /p. The grand mean can be viewed as an estimator of the Pp arithmetic average of all of the population means, j=1 µj /p. We note that prescriptions (4) and (5) can be generalized to accommodate the √ √ √ situation in which σ1 / n1 , σ2 / n2 , . . . , σp / np have a common value τ 6= 1. The generalization is that each instance of (p − 2) or (p − 3) gets multiplied by τ 2 , which represents a common variance for the random variables X1 , X2 , . . . , Xp . Comparing prescriptions (4) and (5), we see that the former entails shrinking the maximum likelihood estimators X1 , X2 , . . . , Xp toward 0, while the latter entails drawing the maximum likelihood estimators X1 , X2 , . . . , Xp toward the grand ¯ In effect, prescription (5) represents a compromise between a maximum mean X. likelihood estimator and the grand mean.
3
A DATA ANALYSIS EXAMPLE
To illustrate Stein’s phenomenon, we consider data on bladder cancer mortality among males in 13 southeastern states of the U.S. (AR, TN, NC, LA, MS, AL, GA, SC, MO, KY, WV, VA, FL). The data were acquired from the website http://statecancerprofiles.cancer.gov/historicaltrend/joinpoint. noimage.html, which is maintained by the National Cancer Institute and the Centers for Disease Control and Prevention. We have p = 13, µ1 = expected annual deaths from bladder cancer per 100,000 males in Arkansas (computed as the average observed number of deaths from bladder cancer per 100,000 males in Arkansas over the years 1991 to 2000), X1 = observed number of deaths from bladder cancer per 100,000 males in Arkansas during the single year 1990, µ2 = expected annual deaths from bladder cancer per 100,000 males in Tennessee (computed as the average observed number of deaths from bladder cancer per 100,000 males in Tennessee over the years 1991 to 2000), X2 = observed number of deaths from bladder cancer per 100,000 males in Tennessee during the single year 1990, and so forth. These quantities are listed in Table 1. We view µ1 , µ2 , . . . , µ13 as targets to be estimated. The intuitive, maximum likelihood estimate of µ1 is X1 , while the James-Stein estimate based on prescrip¯ The maximum likelihood and James-Stein estimates tion (5) is 0.199X1 + 0.801X. of µ2 through µ13 are defined analogously. The technical details of how we acquired the 0.801 for the James-Stein estimates are presented at the end of this section. So, which are better, the maximum likelihood estimates or the James-Stein estimates? Figure 1 compares both the maximum likelihood estimates and the James-Stein estimates to the targets. Compared to the maximum likelihood estimates, the James-Stein estimates are less than half the distance to the targets for Tennessee, South Carolina, Missouri, and Virginia. More specifically, the James-Stein estimates reduce the absolute
Richard Charnigo and Cidambi Srinivasan 1078
Mortality Rate
9 8 7 6 5
TN
NC
LA
MS
Maximum likelihood estimate Expected annual deaths (target to be estimated) James−Stein estimate
AR
AL
GA State
SC
MO
KY
WV
VA
FL
Figure 1. Data analysis exemplifying Stein’s phenomenon
Stein’s Phenomenon
1079
Table 1. Bladder cancer data State j Xj µj Arkansas 1 7.3 7.02 Tennessee 2 6.6 7.32 North Carolina 3 7.1 7.03 Louisiana 4 6.6 6.82 Mississippi 5 7.7 6.12 Alabama 6 7.1 6.29 Georgia 7 6.6 6.75 South Carolina 8 8.0 7.14 Missouri 9 7.9 7.12 Kentucky 10 7.5 7.71 West Virginia 11 7.1 8.21 Virginia 12 8.5 7.48 Florida 13 8.0 7.77
estimation error by 87.3% for Tennessee, 57.3% for South Carolina, 52.9% for Missouri, and 87.6% for Virginia. On the other hand, the maximum likelihood estimates are much closer to the targets than are the James-Stein estimates for North Carolina and Georgia. For the other seven states individually, neither the James-Stein estimates nor the maximum likelihood estimates are dramatically better. However, when we consider all 13 states simultaneously, the James-Stein estimates emerge as superior. The sum of squared estimation errors — called the loss by statisticians — is 13 X j=1
2
( target j − estimate j ) =
13 X j=1
¯ 2 = 4.677 µj − 0.199Xj + 0.801X
for the James-Stein estimates, whereas the loss is 13 X j=1
2
( target j − estimate j ) =
13 X j=1
2
(µj − Xj ) = 7.543
for the maximum likelihood estimates. Hence, the James-Stein estimates reduce the loss by 38%. Besides perceiving whether the maximum likelihood estimates or the JamesStein estimates are closer to the targets, another key to viewing Figure 1 is noticing that the James-Stein estimates for different states are more similar to each other than are the maximum likelihood estimates for different states. This is because the James-Stein estimate for a given state is a compromise between the maximum
1080
Richard Charnigo and Cidambi Srinivasan
likelihood estimate for that state and the average of the maximum likelihood estimates for all of the states: James-Stein estimate for state j =
0.199 maximum likelihood estimate for state j 13 X maximum likelihood estimate for state k /13. + 0.801 k=1
Thus, the James-Stein estimate takes the corresponding maximum likelihood estimate and moves it 80.1% of the way to the average of the maximum likelihood estimates for all of the states. At this juncture, perhaps some readers may think that it is natural to use data from neighboring states to estimate the expected number of deaths from bladder cancer in a given state. Such readers may, without yet being aware of it, already have in mind an empirical Bayes perspective. We will have more to say about that in Section 5. In any event, we conclude this data analysis with three remarks. First, there is some heterogeneity in the expected numbers of deaths, ranging from 6.12 per 100,000 in Mississippi to 8.21 per 100,000 in West Virginia. Thus, whatever may explain the superior performance of the James-Stein estimates, it is emphatically not that µ1 = µ2 = · · · = µ13 . Indeed, if it were true that µ1 = µ2 = · · · = µ13 , then all of the targets would be the same and consideration ¯ would seem obvious. Hence, a more subtle resolution to the of the grand mean X paradox is required. Second, the expected values in inequality (3) should not be overlooked. In particular, inequality (3) is not a promise that the loss will be reduced every time we use the James-Stein estimators. Rather, the long-term average of the loss – if somehow if it were possible to repeat the processes of sampling and constructing estimators an unlimited number of times — is promised to be less with the JamesStein estimators than with the maximum likelihood estimators. Third, even when the loss is reduced dramatically, some of the individual JamesStein estimates may not be as good as the corresponding individual maximum likelihood estimates. We saw this in our data analysis: the individual JamesStein estimates for North Carolina and Georgia were not as good as the individual maximum likelihood estimates for those states. We now close this section with a few technical details for readers interested in the probability modeling underlying our data analysis. Other readers may proceed to Section 4 without loss of continuity. For our data analysis, we calculated the 0.801 used in the James-Stein estimates as (p − 3)τ 2 , Pp ¯ 2 Xj − X j=1
in accord with prescription (5) and the generalization described thereafter. We
Stein’s Phenomenon
1081
approximated τ 2 , the presumed-common variance of X1 , X2 , . . . , X13 , as 13 X
Xj (100, 000/mj )/13,
j=1
where mj is the number of males living in state j during the year 1990. Websites with addresses {http://www.census.gov/population/cencounts/zz190090.txt}, maintained by the U.S. Census Bureau, provided m1 , m2 , . . . , m13 . Note that “zz” is a placeholder for the usual state abbreviation in lower case letters. The above approximation for τ 2 is based on a Poisson probability model for the observed number of bladder cancer deaths in each state. If Yj := Xj (mj /100, 000) — that is, the observed number of bladder cancer deaths in state j — has a Poisson distribution with expected value µj (mj /100, 000), then the variance of Yj is also µj (mj /100, 000). Hence, the variance of Xj = Yj (100, 000/mj ) must be µj (mj /100, 000)(100, 000/mj )2 = µj (100, 000/mj ). We can roughly estimate this as Xj (100, 000/mj ), and averaging the latter quantity over all 13 states yields the above approximation for τ 2 . 4
A GEOMETRIC HEURISTIC
There is an elegant and simple heuristic reasoning underlying Stein’s phenomenon, which we now describe. As before, suppose that X1 , X2 , . . . , Xp are sample means from distinct populations with unknown means µ1 , µ2 , . . . , µp . Moreover, assume √ √ √ that σ1 / n1 = σ2 / n2 = · · · = σp / np = 1 and that X1 , X2 . . . , Xp are normally distributed. Using elementary algebra, one obtains the identity v uX p p p X X X u p 2 2 2 2 µj Z, (6) Xj = (Xj − µj ) + µj + 2t j=1
j=1
j=1
j=1
qP Pp p 2 where Z := µ (X − µ )/ j j j j=1 µj . Since X1 , X2 , . . . , Xp are normally j=1 qP p 2 distributed, so is Z, with mean 0 and variance 1. Hence, 2 j=1 µj Z is negPp ligible in relation to the larger of p and j=1 µ2j as p becomes large. Moreover, Pp 2 j=1 (Xj − µj ) equals p plus a remainder that is negligible in relation to p as p becomes large. Thus, identity (6) simplifies to (7)
p X j=1
Xj2 ≈ p +
p X
µ2j .
j=1
Approximation (7) holds for all numerical values of µ1 , µ2 , . . . , µp . Pp We see from approximation (7) that, for large p, a good estimator of j=1 µ2j Pp is j=1 Xj2 − p, assuming that the latter turns out to be nonnegative. A negative
1082
Richard Charnigo and Cidambi Srinivasan
Pp Pp 2 2 value of j=1 µj is small. If such a j=1 Xj − p is possible, especially when Pp 2 contingency arises, we may sensibly replace j=1 Xj − p by 0. However, for simplicity in what follows, we do not consider such aPcontingency. Pp p The maximum likelihood estimator of j=1 µ2j is j=1 Xj2 , since the maximum likelihoodPestimator of µj , 1 ≤ j ≤ p, is Xj . Thus, the maximum likelihood estip mator of j=1 µ2j tends to err on the high side, suggesting that a better estimator of µj , 1 ≤ j ≤ p, is available. In particular, we may consider multiplicatively adjusting Xj , 1 ≤ j ≤ p, by sP r p 2 p/2 p k=1 Xk − p P = ≈ 1 − Pp , (8) 1 − Pp p 2 2 2 X X k=1 k k=1 k k=1 Xk
√ the last approximation in (8) following from the calculus result 1 + x ≈ 1 + ′ x/2 for small |x|. Such multiplication shrinks the vector X := (X1 , Xq 2 , . . . , Xp ) Pp 2 toward the origin in p-dimensional space while leaving its direction X/ j=1 Xj undisturbed. Figure 2 illustrates the geometry for p = 2. The joint probability density function for X1 and X2 is greater everywhere inside the dark-and-light circle than anywhere outside it. Roughly speaking, this means that the most likely values of X1 and X2 are those within the dark-and-light circle. However, there is an asymmetry. The region in which X12 + X22 < µ21 + µ22 , colored light, is smaller than the region in which X12 + X22 > µ21 + µ22, colored dark.This asymmetry arises p because the circle centered at the origin with radius µ21 + µ22 is a convex set in 2-dimensional space. Hence, X12 + X22 > µ21 + µ22 is a more probable outcome than X12 + X22 < µ21 + µ22 . Shrinkage of (X1 , X2 )′ toward the origin works in our favor when (X1 , X2 )′ falls in the dark region and works against us when (X1, X2) ′ falls in the light region, but the former is more likely than the latter. A similar geometry applies for p > 2 because the hypersphere centered at the origin with qP p 2 radius j=1 µj is a convex set in p-dimensional space.
Figure 3 provides a numerical example in which µ1 = µ2 = 5 and various values of (X1 , X2 )′ are identified as belonging either to the dark region in which X 12 + X 22 > 50 or to the light region in which X12 + X 22 < 50. Of particular interest is that there exist points symmetric about (5, 5)′ that land inside the dark region. For instance, (2.5, 7)′ and (7.5, 3)′ both fall inside the dark region. On the other hand, there do not exist points symmetric about (5, 5)′ that fall inside the light region. This indicates that (X1 , X2 )′ is more likely to land inside the dark region, which is to say that X 12 + X 22 > 50 is more probable than X 12 + X 22 < 50. Based on the above geometric heuristic, Stein [1956] provided a rigorous mathematical justification for the existence of better estimators of µ1 , µ2 , . . . , µp than the maximum likelihood estimators X1 , X2 , . . . , Xp . The technical meaning of “better” entails the decision theoretic concepts of loss and risk. For any estimators D1 , D2 , . . . , Dp of µ1 µ2 , . . . , µp , we define the loss to be the sum of squared estimation errors,
Stein’s Phenomenon
1083
Figure 2. Geometric Heuristic for Shrinkage
(9)
p X j=1
(µj − Dj )2 .
(Other definitions of the loss are possible, as will be noted toward the end of this section.) The risk is defined as the expected value of the loss, p X (10) E (µj − Dj )2 . j=1
Assembling µ1 , µ2 , . . . , µp into a vector µ and D1 , D2 , . . . , Dp into a vector D, we may use the shorthand L(µ, D) for the loss (9) and the shorthand R(µ, D) for the risk (10). Stein [1956] showed that, for any p ≥ 3, there exists a D such that (11) R(µ, D) < R(µ, X),
where, as before, X is a vector containing the maximum likelihood estimators X1 , X2 , . . . , Xp . Importantly, relation (11) holds regardless of what µ may be. James and Stein [1961] subsequently established relation (11) for the specific choice
1084
Richard Charnigo and Cidambi Srinivasan
Figure 3. Geometric Heuristic for Shrinkage (including representative points) p−2 Xj , 1 ≤ j ≤ p, (12) Dj := 1 − Pp 2 k=1 Xk
which entails the kind of shrinkage displayed in (8) except that p/2 has been replaced by p − 2. Figure 4 illustrates the reduction in risk associated with the estimators (12) for a fixed p ≥ 3. While R(µ, X) = p regardless of µ, R(µ, D) with D as in (12) has the form p − (p − 2) b(µ),
where b(µ) is bounded above by 1, equal to 1 when µ is the zero vector, and qP p 2 vanishingly small as j=1 µj becomes large. Thus, the estimators (12) are qP p 2 dramatically better than the maximum likelihood estimators when j=1 µj is small. Intuitively, shrinking the vector of maximum likelihood estimators toward the origin must work in our favor if µ itself is at the origin! The estimators (12) can be modified to pull the vector of maximum likelihood estimators toward any point of the form (C, C, . . . , C)′ , where C is a real constant or a real-valued random variable. A case of particular interest, already considered ¯ In this in the data analysis of Section 3, is when C equals the grand mean X. case, (12) is changed to
Stein’s Phenomenon
1085
risk 0
2
4
6
8
maximum likelihood estimation James−Stein estimation
magnitude of mu vector
Figure 4. Comparison of risks with and without shrinkage, p = 5
1086
Richard Charnigo and Cidambi Srinivasan
¯ + 1 − Pp (13) Dj := X
p−3 ¯ 2 k=1 (Xk − X)
¯ (Xj − X),
1 ≤ j ≤ p.
Also, as was suggested in Section 2, a further modification is possible if √ √ √ σ1 / n1 , σ2 / n2 , . . . , σp / np share a common value other than 1. We close this section by noting that Stein’s phenomenon holds under far more general conditions than indicated above. Hudson [1974], Efron and Morris [1976], and Berger [1976; 1985] are among those who have exhibited Stein-type estimators superior to X1 , X2 , . . . , Xp under more general conditions. One generalization allows X1 , X2 , . . . , Xp to have unequal variances and/or to be correlated. Another generalization entails replacing the loss (9) and risk (10) by ′
(14) LQ (µ, D) := (µ − D) Q(µ − D) =
p p X X
j=1 k=1
qjk (µj − Dj )(µk − Dk )
and RQ (µ, D) := E [LQ (µ, D)] , respectively, where Q is a given p × p positive definite matrix whose (j, k) element is qjk , 1 ≤ j, k ≤ p. As an example, suppose that X1 , X2 , . . . , Xp have known positive variances τ12 , τ22 , . . . , τp2 (but are uncorrelated) and that Q is a diagonal matrix (i.e., qjk = 0 for j 6= k). If p ≥ 3, then better estimators of µ1 , µ2 , . . . , µp than X1 , X2 , . . . , Xp are given by ! p−2 Dj := 1 − 2 Pp Xj , 1 ≤ j ≤ p. τj qjj k=1 Xk2 /(τk4 qkk )
They simplify to the James-Stein estimators in (12) when τ12 = τ22 = · · · = τp2 = 1 and q11 = q22 = · · · = qpp = 1. 5
AN EMPIRICAL BAYES PERSPECTIVE
The fundamental ideas underlying Stein’s phenomenon are by no means restricted to a classical frequentist perspective, in which µ1 , µ2 , . . . , µp are treated as unknown constants. Indeed, the logic behind Stein’s phenomenon transcends different schools of statistical thought. Efron and Morris [1971; 1972; 1973; 1975; 1976] and Berger [1982; 1985] are among those who have examined Stein’s phenomenon from a Bayesian or empirical Bayesian perspective. Such a perspective entails treating µ1 , µ2 , . . . , µp as random quantities with their own distributions. A further distinction must be made between prior distributions, which pertain before X1 , X2 , . . . , Xp have been observed, and posterior distributions, which pertain after X1 , X2 , . . . , Xp have been observed. An elegant and intuitive justification of Stein’s phenomenon emerges, as described next.
Stein’s Phenomenon
1087
Suppose that each µj , 1 ≤ j ≤ p, has the same normal prior distribution with mean θ and variance δ 2 . The values of θ and δ 2 reflect an individual’s subjective view — before X1 , X2 , . . . , Xp have been observed — about what numerical values are likely for µ1 , µ2 , . . . , µp . Suppose also that the distribution of Xj given µj , 1 ≤ j ≤ p, is normal with mean µj and variance τ 2 . Then, once X1 , X2 , . . . , Xp have been observed, a famous result called Bayes’ Theorem may be applied to derive the posterior distributions of µ1 , µ2 , . . . , µp . In fact, the posterior distribution of µj , 1 ≤ j ≤ p, is [Gelman et al., 1995] normal with mean 1 (Xj − θ) (15) θ + 1 − 1 + δ 2 /τ 2 and variance
1 . 1/τ 2 + 1/δ 2
If we have to make a single number guess for µj after observing X1 , X2 , . . . , Xp , we may be inclined to use (15). The resemblance to (13) is unmistakable, but the connection to Stein’s phenomenon can be made even more transparent. Suppose that, instead of specifying θ and δ 2 a priori, we choose them after observing X1 , X2 , . . . , Xp . In essence, we are now treating θ and δ 2 as unknown ¯ constants that must be estimated. that a good choice for θ is X PpOne may ¯argue 2 2 2 and that a good choice for δ is k=1 (Xk − X) /(p − 3) − τ , assuming that the latter is positive. Substituting these choices for θ and δ 2 into (15) yields (p − 3)τ 2 ¯ ¯ P (16) X + 1 − p ¯ 2 (Xj − X), k=1 (Xk − X)
which simplifies to (13) when τ 2 = 1. To distinguish (15) from (16), we refer to the former as a Bayesian estimator and to the latter as an empirical Bayesian estimator. The empirical Bayesian estimator reflects the idea, also exemplifed by Robbins’ [1951] quotation, of borrowing strength from unrelated data. Here the data X1 , X2 , . . . , Xj−1 , Xj+1 , . . . , Xp are apparently unrelated to Xj , yet they provide strength in the form of a reduced risk in estimating µj . —idexcancer We conclude this section with two remarks. First, by adopting an empirical Bayesian perspective, Efron and Morris [1973] were able to give a short and easily accessible proof for the superiority of the James-Stein estimators (12) over maximum likelihood estimators. Second, some further interpretation is now possible for the data analysis example of Section 3, as explained below. If µ1 , µ2 , . . . , µ13 share a common prior mean θ that we treat as an unknown constant, then every observation X1 , X2 , . . . , X13 contains information about the value of θ. Hence, although X1 contains the most information about µ1 , X2 , . . . , X13 also contain some information about µ1 . Analogous statements apply to µ2 , . . . , µ13 . In essence, θ unites what otherwise may appear to be 13 distinct estimation problems. From a subject matter viewpoint, we may conceptualize θ as the expected annual deaths from bladder cancer per 100,000 males over the whole 13-state region.
1088
Richard Charnigo and Cidambi Srinivasan
This makes highly plausible the idea that using every observation to estimate µ1 should be better than using only X1 , using every observation to estimate µ2 should be better than using only X2 , and so forth. However, while conceptualizing θ from a subject matter viewpoint makes Stein’s phenomenon highly plausible in this example, in general the presence of Stein’s phenomenon does not depend on the existence of — let alone our recognition of — such a conceptualization. 6
A REGRESSION PERSPECTIVE
The classical idea of regression as described by Galton [1890] provides another intuitive explanation for Stein’s phenomenon. The excellent article by Stigler [1990] discusses Stein’s phenomenon from this perspective, rendering transparent the justification for estimators like (13). We follow Stigler [1990] closely in the presentation below. As in Section 2, we have pairs (X1 , µ1 ), (X2 , µ2 ), . . . , (Xp , µp ) such that X1 , X2 , . . . , Xp are observable but µ1 , µ2 , . . . , µp are not. We assume that X1 , X2 , . . . , Xp are normally distributed with respective means µ1 , µ2 , . . . , µp and common variance 1. As such, we may view X1 , X2 , . . . , Xp as generated from the model (17) Xj = µj + ǫj for 1 ≤ j ≤ p, where ǫ1 , ǫ2 , . . . , ǫp are normally distributed “error” terms with common mean 0 and common variance 1. Even though µ1 , µ2 , . . . , µp are unknown, we can conceptualize a scatterplot of µ1 , µ2 , . . . , µp against X1 , X2 , . . . , Xp . This is illustrated for simulated data in Figure 5. Now imagine fitting two regressions to the points in the scatterplot: a regression of X1 , X2 , . . . , Xp on µ1 , µ2 , . . . , µp symbolized as R1 (X|µ) = α1 + β1 µ, and a regression of µ1 , µ2 , . . . , µp on X1 , X2 , . . . , Xp symbolized as R2 (µ|X) = α2 + β2 X. Not surprisingly given (17), R1 (X|µ) turns out to be the 45-degree line: α1 = 0 and β1 = 1. Figure 5 shows R1 (X|µ) in red, from which we see that R1 (X|µ) avoids large horizontal deviations by the points in the scatterplot. Note that maximum likelihood estimation of µ1 , µ2 , . . . µp is, in effect, using R1 (X|µ) to predict µ from X — even though the form of R1 (X|µ) indicates prediction of X from µ! To predict µ from X, a more appropriate option than using R1 (X|µ) is to employ R2 (µ|X). Geometrically, we want to avoid large vertical deviations by the points in the scatterplot. Appealing to least squares, we may estimate β2 and α2 by Pp ¯ ¯) j=1 (Xj − X)(µj − µ ˆ ¯ Pp (18) β2 := and α ˆ 2 := µ ¯ − βˆ2 X, 2 ¯ j=1 (Xj − X) Pp where µ ¯ := j=1 µj /p. Figure 5 shows R2 (µ|X) in blue, with α2 and β2 estimated from the points in the scatterplot per (18). The blue line in Figure 5 illustrates Galton’s [1890] phenomenon of regression to the mean:
1089
4 2
mu_1,mu_2,...,mu_p
6
8
Stein’s Phenomenon
0
regressing X's on mu's regressing mu's on X's
0
2
4
6
8
X_1,X_2,...,X_p
Figure 5. Stein estimation from a regression perspective extremely low values of X suggest low — but not as extremely low — values of µ, while extremely high values of X suggest high — but not as extremely high — values of µ. This idea is further illustrated in Figure 6, which pertains to the bladder cancer data from Table 1. With these real data, we find that R1 (X|µ) has greater than unit slope. So, in fact, predicting µ from X based on R1 (X|µ) is not merely counterintuitive but also outlandish. For instance, a one-year mortality rate of 8.5 (per 100, 000) would lead to a predicted ten-year mortality rate of 12.2, whereas a one-year mortality rate of 7.5 would lead to a predicted ten-year mortality rate of 7.7. On the other hand, predicting µ from X based on R2 (µ|X) seems quite reasonable. Of course, we took liberties here for the sake of illustration: the practitioner cannot use µ1 , µ2 , . . . , µp to estimate α2 and β2 because the practitioner does not know µ1 , µ2 , . . . , µp . Stigler [1990] argued that βˆ2 and α ˆ 2 in (18) may be reasonably replaced by p−1 ¯ 2 k=1 (Xk − X)
βe2 := 1 − Pp
and
¯ − βe2 X. ¯ α e2 := X
Then, if we estimate µj for 1 ≤ j ≤ p by α e2 + βe2 Xj , we obtain
Richard Charnigo and Cidambi Srinivasan
6.0
6.5
7.0
mu_1,mu_2,...,mu_13
7.5
8.0
8.5
1090
5.5
regressing X's on mu's regressing mu's on X's
6.5
7.0
7.5
8.0
8.5
X_1,X_2,...,X_13
Figure 6. Stein estimation from a regression perspective (bladder cancer data) ¯ + 1 − Pp (19) X
p−1 ¯ 2 k=1 (Xk − X)
¯ (Xj − X).
Expression (19) is identical to expression (13), except that p − 3 has been replaced by p − 1. The estimators in (19) are superior to maximum likelihood estimators, in that they have a lower risk (10), as long as p ≥ 6. We remark that Meng [2005] recovered the estimators (13), with the p − 3 intact, by modifying Stigler’s [1990] derivation. Meng’s [2005] approach employed a special formula relating the expected value of a ratio of random variables to a statistical entity called the moment generating function. 7
RELATED METHODOLOGICAL DEVELOPMENTS
The impact of Stein’s phenomenon on statistical science for multidimensional problems has been striking. Indeed, Stein’s phenomenon has motivated numerous methodological and theoretical advances, some of which are taking place even now — more than a half-century after Stein’s crucial insight! This section briefly highlights some such advances, mostly in the context of a normal probability model like that described at the beginning of Section 4. We
Stein’s Phenomenon
1091
emphasize that this section is by no means an exhaustive account of the advances that have been inspired by Stein’s phenomenon. In particular, much work has also been done outside the context of a normal probability model. To clarify the impetus for methodological and theoretical developments following the revelation of Stein’s phenomenon, we reiterate that a James-Stein estimator such as (4) or (5) performs admirably in reducing the risk (10) in the simultaneous estimation of µ1 , µ2 , . . . , µp . Yet, an individual mean µj (1 ≤ j ≤ p) may be grossly misestimated by (4) or (5), particularly if µj is not close to most of the other means µ1 , µ2 , . . . , µj−1 , µj+1 , . . . , µp . Considering again the data analysis example in Section 3, the expected annual deaths per 100, 000 males for Georgia was only 6.75. This was smaller than for all but two of the other states, so the maximum likelihood estimate of 6.60 for Georgia was much better than the James-Stein estimate hP of 7.23. i p 2 To emphasize the distinction between reducing E and conj=1 (µj − Dj ) 2 trolling E (µj − Dj ) for a single specific j (1 ≤ j ≤ p), statisticians sometimes refer to the former as the “ensemble risk” and the latter as a “component risk”. Unlike the ensemble risk, a component risk can be higher with a James-Stein estimator than with the maximum likelihood estimator. This might have discouraged some data analysts from using a James-Stein estimator in practical applications. Recognizing this, Efron and Morris [1971; 1972; 1973; 1975; 1976] made significant methodological and theoretical advances that rendered Stein-type estimators more attractive and accessible to practitioners: 1. They proposed “limited translation estimators” expressible in the form p X (p − 2) 1 − P ρ (p − 2)X12 / Xj2 X1 , p 2 D X j=1 j j=1
p X
(p − 2) (p − 2)X22 / 1 − P Xj2 X2 , . . . , p 2 ρD X j=1 j j=1
p X (p − 2) 1 − Pp (p − 2)Xp2 / Xj2 Xp 2 ρD X j=1 j j=1
with ρD (u) := min{1, D/u1/2 } for a specified D > 0. The practitioner can choose D to avoid deviating “too much” from the maximum likelihood estimator, and thereby avoid having too large a component risk, while still gaining some reduction of the ensemble risk. In particular, taking D smaller exercises more control over the component risks, while taking D larger achieves greater reduction of the ensemble risk. If desired, these limited translation ¯ (or to any estimators can be modified to shrink X1 , X2 , . . . , Xp toward X specific nonzero constant) rather than to 0.
1092
Richard Charnigo and Cidambi Srinivasan
2. They described how to accommodate the more realistic scenario in which X1 , X2 , . . . , Xp have unknown rather than known variances. 3. They articulated Bayesian and empirical Bayesian justifications for Steintype estimators, providing convincing arguments for the use of Stein-type estimators in real life data analysis problems. Following Efron and Morris, other authors developed numerous variants of Stein-type estimators to provide further appeal to practitioners and to accommodate loss functions of the form (14) rather than (9) exclusively. Berger [1980; 1985] presents an excellent account of these developments. An important methodological advance based on Stein’s phenomenon is the creation of improved confidence sets. A confidence set is a data-dependent region of the parameter space that has a guaranteed probability, called the “confidence level”, of containing the true value of the unknown parameter describing how the data are generated. A student in an introductory statistics course learns that a confidence interval — a special kind of confidence set when the unknown parameter is one-dimensional — quantifies uncertainty in the estimation of the unknown parameter and that, moreover, there is a duality between confidence intervals and hypothesis tests. Here duality refers to a general correspondence between a significance level α test of a null hypothesis concerning the unknown parameter and a confidence interval for the same unknown parameter that has confidence level 100(1 − α)%. More specifically, the 100(1 − α)% confidence interval consists precisely of those candidate values for the parameter that would not be rejected by the significance level α test. For example, suppose that we have a single observation X available from a normal distribution with unknown mean µ and known standard deviation 1. Consider testing the null hypothesis that µ equals the candidate value µ0 . The test rejects the null hypothesis at significance level 0.05 if and only if |X − µ0 | > 1.96, which is to say that the null hypothesis is not rejected if and only if X is in the interval µ0 ± 1.96. With the help of a little algebra, we see that X falling in the interval µ0 ± 1.96 is the same as µ0 belonging to the interval X ± 1.96. Hence, X ± 1.96 defines a 95% confidence interval for µ. This process can also be reversed, to obtain a hypothesis test from a confidence interval. As the above example suggests, duality facilitates the use of confidence intervals in drawing inferences about the unknown parameter. Moreover, a similar duality pertains to hypothesis tests and confidence sets when the unknown parameter is multi-dimensional, so that developing better confidence sets is indeed a step forward in statistical science. Regarding the normal probability model with which we began Section 4, a conventional confidence set for the vector µ := (µ1 , µ2 , . . . , µp )′ is a p-dimensional sphere centered at the maximum likelihood estimator X := (X1 , X2 , . . . , Xp )′ , p X C0 (X) := µ : (µj − Xj )2 ≤ m2 . j=1
Stein’s Phenomenon
1093
The radius m > 0 of the p-dimensional sphere is chosen to achieve a desired confidence level. In the special case that p = 1, the sphere is just an interval and C0 (X) has the form (20) estimate ± margin of error, where the radius is identified with the margin of error. Although formula (20) does not apply when p > 1, the radius is still referred to as the margin of error even when p > 1. Motivated by his work leading to the James-Stein estimator, Stein [1962] initiated the search for improved confidence sets. Based on approximate calculations, he conjectured that there should exist a confidence set with higher confidence level than C0 (X) but with equal or lesser margin of error when p ≥ 3. This conjecture was settled affirmatively by Brown [1966] and Joshi [1967], although their arguments did not lend themselves to the explicit construction of the better confidence set. Employing Bayesian reasoning, Faith [1976] and Berger [1980] derived confidence sets that improved upon C0 (X). However, these confidence sets were not directly related to a Stein-type estimator. On the other hand, Hwang and Casella [1982] explicitly appealed to a Stein-type estimator in deriving a confidence set superior to C0 (X). The improved confidence set is p X CHC (X) := µ : (µj − µ ej )2 ≤ m2 , j=1
where
"
µ e1 := 1 − Pp "
j=1
µ e2 := 1 − Pp
Xj2
a
j=1
"
a
µ ep := 1 − Pp
Xj2
#+
a
j=1
#+
Xj2
X1 ,
X2 , . . . ,
#+
Xp
for an appropriately chosen positive constant a. Above, u+ := max{0, u} denotes the “positive part” of u. The confidence set CHC (X) achieves a higher confidence level than C0 (X) when p ≥ 4. Tseng and Brown [1997] and Samworth [2005] subsequently constructed confidence sets with the same confidence level as C0 (X) but strictly smaller margin of error. These confidence sets are potentially more appealing to data analysts, as many practitioners prefer to fix the confidence level and seek a smaller margin of error rather than fix the margin of error and seek a larger confidence level. Efron [2006]d) recently presented a very elegant and general method of constructing confidence sets with smaller margin of error than C0 (X) and quantified the extent to
1094
Richard Charnigo and Cidambi Srinivasan
which the margin of error could be reduced. The aforementioned duality between hypothesis tests and confidence sets was part of Efron’s derivation. Another key methodological advance related to Stein’s phenomenon involves estimation of the “covariance matrix” Σ or the corresponding “precision matrix” Σ−1 , where the superscript −1 denotes the operation of matrix inversion. We refer to Σ as the covariance matrix because its (j, k) element (1 ≤ j, k ≤ p) is the covariance between Xj and Xk . (If j = k, then this is just the variance of Xj .) The conventional estimator of Σ is c0 := S, Σ
whose (j, k) element (1 ≤ j, k ≤ p) is the sample analogue to the (j, k) element of Σ in much the same way that Xj (1 ≤ j ≤ p) is the sample analogue to µj . The corresponding conventional estimator of Σ−1 is d −1 −1 , Σ 0 := KS
where K is a positive constant depending on p and the sample sizes n1 , n2 , . . . , np . Surprisingly, a Stein-type phenomenon holds in the estimation of Σ and Σ−1 , even though the quantities being estimated are matrices. Regarding Σ−1 , Efron and Morris [1976] derived a Stein-type estimator that expands, rather than shrinks, the conventional estimator KS−1 . This estimator is p2 + p − 2 −1 \ −1 + Σ Ip×p , EM := KS tr(S) where tr represents the trace of a matrix (sum of the diagonal elements) and Ip×p is the identity matrix (diagonal elements equal to 1, non-diagonal elements equal −1 \ to 0) with p rows and p columns. Efron and Morris [1976] showed that Σ has EM
d d −1 −1 smaller risk than Σ is defined, up 0 . In this setting, the risk of an estimator Σ to a constant of proportionality, as # " d −1 tr((Σ − Σ−1 )2 S) . (21) E tr(Σ−1 ) The loss underlying (21) is not as transparent as (9). However, the loss underlying (21) has the quadratic-like feature of depending on the squared difference between d −1 the estimator Σ and the target Σ−1 .
On the other hand, Stein-type estimators for Σ do entail a sort of shrinkage. In particular, quantities called the “eigenvalues” of S are pulled toward each other, in much the same way that the James-Stein estimator (5) pulls X1 , X2 , . . . , Xp toward each other. Dey and Srinivasan [1985] constructed a wide class of Stein-type estimators with smaller risk than the conventional estimator S. In this setting, b is defined as the risk of an estimator Σ
Stein’s Phenomenon
1095
h i b −1 ) − log{det(ΣΣ b −1 )} − p , (22) E tr(ΣΣ
where det is the determinant of a matrix. The loss underlying (22) is referred to as “entropy loss”. Finally, a major theoretical development associated with Stein’s phenomenon is the characterization of estimators of µ that can be improved upon when quadratic loss underlies the risk. Brown [1971] and Srinivasan [1981] provided a complete characterization of such estimators.
8
STEIN’S PHENOMENON AND MODERN SCIENCE
We conclude this article with a brief discussion of how Stein’s phenomenon is helping to shape modern science. As with the last section, we emphasize that the present discussion does not provide an exhaustive account. We offer some general remarks and then consider one specific illustration in detail. Many of the challenging statistical problems in modern science relate to the dimensionality of the data, specifically a large ensemble size p accompanied by comparatively small sample sizes n1 , n2 , . . . , np . These are often referred to as “large p, small n” scenarios. Such scenarios arise in numerous fields, including molecular biology and genetics, finance, econometrics, computer science, and high energy physics. A prototypical example is microarray data analysis, in which expression data are obtained for each of p genes on n = n1 = n2 = · · · = np subjects with p >> n. Classical statistical methods fare poorly in this “large p, small n” scenario, which has inspired a massive body of literature on microarray data analysis; see Dai and Charnigo [2008] and references therein for just a few of the many works. Some of the more recent literature on microarray data analysis employs Stein’s phenomenon. For instance, Schafer and Strimmer [2005] developed a Stein-type covariance matrix estimator to map gene associations. Their application reinforces the idea that Stein’s phenomenon is not limited to measures of central tendency. An interesting illustration of Stein’s phenomenon in modern science, which we consider in some detail, arises in molecular systems biology. Here a major goal is to characterize the interplay among genes so that molecular mechanisms of diseases and associated cellular functions can be identified and described. Fulfillment of this goal requires estimating associations between pairs of genes and mapping the interplay among genes as a network. A general model for the association between two genes with expression levels X and Y entails a quantity M (X, Y ) called “mutual information”. The mutual information is related to the statistical concept of “entropy”. When X and Y are jointly normally distributed, the mutual information is a measure of linear association that is closely related to the familiar Pearson correlation. If we discretize the expression levels into p categories and model X and Y with
1096
Richard Charnigo and Cidambi Srinivasan
multinomial distributions, then the mutual information is given by M (X, Y ) := H(X) + H(Y ) − H(X, Y ),
Pp where H(X) := − j=1 θj log θj and θ1 , θ2 , . . . , θp are the probabilities that X falls within the respective categories. We may define H(Y ) and H(X, Y ) analogously, noting that there are p probabilities underlying H(Y ) and p × p probabilities underlying H(X, Y ). Estimating M (X, Y ) is a simple matter if the underlying probabilities can be recovered. Let θ be a vector containing these probabilities. bM L does not work well when p The conventional maximum likelihood estimator θ is large relative to the sample sizes. Hausser and Strimmer [2008] therefore developed a Stein-type estimator of the form b∗ := λθ bT arget + (1 − λ)θ bM L , θ T arget
b where θ is a “target estimator” obtained by reducing the dimension of the data and λ is determined adaptively from the data. Hausser and Strimmer [2008] b∗ in recovering M (X, Y ). Furprovided empirical evidence for the superiority of θ b∗ in the analysis of E. Coli stress response data thermore, when they applied θ from 102 genes, they found three new significant associations that had not been detected with other methods. ACKNOWLEDGEMENTS This material is based upon work supported by the National Science Foundation under Grant No. DMS-0706857. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We thank Prasanta Bandhyopadhyay and an anonymous Reviewer for their comments on an earlier version. BIBLIOGRAPHY [Berger, 1976] J. Berger. Admissible minimax estimation of a multivariate normal mean with arbitrary quadratic loss. Annals of Statistics, 4, 223-226, 1976. [Berger, 1980] J. Berger. A robust generalized Bayes estimator and confidence region for a multivariate normal mean. Annals of Statistics, 8, 716-751, 1980. [Berger, 1982] J. Berger. Bayesian robustness and the Stein effect. Journal of the American Statistical Association, 77, 358-368, 1982. [Berger, 1985] J. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, New York. 1985. [Blyth, 1951] C. Blyth. On minimax statistical decision problems and their admissibility. Annals of Mathematical Statistics, 22, 22-42, 1951. [Brown, 1966] L. D. Brown. On the admissibility of invariant estimators of one or more location parameters. Annals of Mathematical Statistics, 37, 1087-1136, 1966. [Brown, 1971] L. D. Brown. Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Annals of Mathematical Statistics, 42, 855-903, 1971.
Stein’s Phenomenon
1097
[Cox and Hinkley, 1974] D. R. Cox and D. V. Hinkley. Theoretical Statistics. Chapman & Hall/CRC, Boca Raton, 1974. [Dai and Charnigo, 2008] H. Dai and R. Charnigo. Omnibus testing and gene filtration in microarray data analysis. Journal of Applied Statistics, 35, 31–47, 2008. [Dey and Srinivasan, 1985] D. Dey and C. Srinivasan. Estimation of covariance matrix under Steins loss. Annals of Statistics, 13, 1581-1591, 1985. [Efron, 2006] B. Efron. Minimum volume confidence regions for a multivariate normal mean vector. Journal of the Royal Statistical Society Series B, 68, 655-670, 2006. [Efrn and Morris, 1971] B. Efron and C. Morris. Limiting the risk of Bayes and empirical Bayes estimators — part I: the Bayes case. Journal of the American Statistical Association, 66, 807-815. 1971. [Efron and Morris, 1972] B. Efron and C. Morris. Limiting the risk of Bayes and empirical Bayes estimators — part II: the empirical Bayes case. Journal of the American Statistical Association, 67, 130-139. 1972. [Efron and Morris, 1973] B. Efron and C. Morris. Stein’s estimation rule and its competitors — an empirical Bayes approach. Journal of the American Statistical Association, 68, 117-130, 1973. [Efron and Morris, 1975] B. Efron and C. Morris. Data analysis using Stein’s estimator and its competitors. Journal of the American Statistical Association, 70, 311-319, 1975. [Efron and Morris, 1976] B. Efron and C. Morris. Multivariate empirical Bayes and estimation of covariance matrices. Annals of Statistics, 4, 22-32, 1976. [Faith, 1976] R. E. Faith. Minimax Bayes set and point estimators of a multivariate normal mean. Technical Report 66, University of Michigan, 1976. [Fisher, 1925] R. Fisher. Theory of statistical estimation. In Proceedings of the Cambridge Philosophical Society, 22, 700-725, 1925. [Galton, 1890] F. Galton. Kinship and correlation. North American Review, 150, 419-431, 1890. [Gauss, 1821] G K. Gauss. Theory on the combination of observations least subject to errors (Part One), 1821. Translation in same-titled book by G. Stewart (1995), Society for Industrial and Applied Mathematics. [Gauss, 1823] K. Gauss. Theory on the combination of observations least subject to errors (Part Two), 1823. Translation in same-titled book by G. Stewart (1995), Society for Industrial and Applied Mathematics. [Gauss, 1826] K. Gauss. Theory on the combination of observations least subject to errors (Supplement), 1826. Translation in same-titled book by G. Stewart (1995), Society for Industrial and Applied Mathematics. [Gelman et al., 1995] A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, 1995. [Hausser and Strimmer, 2008] J. Hausser and K. Strimmer. Entropy inference and the JamesStein estimator, with applications to nonlinear gene association networks. Technical report, arXiv:0811.3579v2 [stat.ML] 31 Dec 2008. [Hodges and Lehmann, 1950] J. Hodges and E. Lehmann. Some problems in minimax point estimation. Annals of Mathematical Statistics, 21, 182-197, 1950. [Hudson, 1974] M. Hudson. Empirical Bayes estimation. Technical Report 58, Stanford University, 1974. [Hwang and Casella, 1982] J. Hwang and G. Casella. Minimax confidence sets for the mean of a multivariate normal distribution. Annals of Statistics, 10, 868-881, 1982. [James and Stein, 1961] W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability (J. Neyman, ed.), 1, 361-379. University of California Press, 1961. [Joshi, 1967] V. Joshi. Confidence intervals for the mean of a finite population. Annals of Mathematical Statistics, 38, 1180-1207, 1961. [LeCam, 1953] L. LeCam. On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates. University of California Publications in Statistics, 1, 277-329, 1953. [Meng, 2005] X. Meng. From unit root to Stein’s estimator to Fisher’s k statistics: if you have a moment, I can tell you more. Statistical Science, 20, 141-162, 2005. [Neyman, 1937] J. Neyman. Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences. 236, 333-380, 1937.
1098
Richard Charnigo and Cidambi Srinivasan
[Rao, 1945] C. R. Rao. Information and accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society. 37, 81-91, 1945. [Rao, 1967] C. R. Rao. Least squares theory using an estimated dispersion matrix and its application to measurement of signals. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (L. LeCam and J. Neyman, ed.), 1, 355-372. University of California Press, 1967. [Robbins, 1951] H. Robbins. Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability (J. Neyman, ed.), 131-148. University of California Press, 1951. [Royall, 1997] R. Royall. Statistical Evidence: A Likelihood Paradigm. Chapman & Hall/CRC, London, 1997. [Samworth, 2005] R. Samworth. Small confidence sets for the mean of a spherically symmetric distribution. Journal of the Royal Statistical Society Series B, 67, 343-361, 2005. [Schafer and Strimmer, 2005] J. Schafer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4, Issue 1, Article 32, 2005. [Srinivasan, 1981] C. Srinivasan. Admissible generalized Bayes estimators and exterior boundary value problems. Sanhkya Series A, 43, 1-25, 1981. [Stein, 1956] C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (J. Neyman, ed.), 1, 197-206. University of California Press, 1956. [Stein, 1962] C. Stein. Confidence sets for the mean of a multivariate normal distribution (with discussion). Journal of the Royal Statistical Society Series B, 24, 265-296, 1962. [Stigler, 1990] S. Stigler. The 1988 Neyman memorial lecture: a Galtonian perspective on shrinkage estimators. Statistical Science, 5, 147-155, 1990. [Tseng and Brown, 1997] Y. L. Tseng and L. D. Brown. Good exact confidence sets for a multivariate normal mean. Annals of Statistics, 25, 2228-2258, 1997. [Wald, 1949] A. Wald. Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics, 20, 595-601.
DATA, DATA, EVERYWHERE: STATISTICAL ISSUES IN DATA MINING Choh Man Teng
1
OCEANS OF DATA
Data used to be hard to come by and even harder to analyze in any large scale fashion. Advances in collection and storage capabilities have made it relatively convenient to produce and accumulate large volumes of data, often automatically archiving them in data repositories. NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) onboard the Terra spacecraft, for instance, generates 850 gigabytes of data per day [MODIS, 1999]. The number of nucleotide bases in GenBank, National Center for Biotechnology Information’s genetic sequence database, has been doubling approximately every 18 months; a recent release contains more than 110 billion bases [GenBank, 2010]. On the other hand, data are still scarce in some domains. Microarray data, for example, are typically high dimensional and have a high variable-to-case ratio. It is not uncommon to have more variables than cases in a data sample. The problem space, however, is still large, as the space is determined not by the sample size but by the number of variables. This situation further complicates the analysis process as the amount of data often does not provide sufficient statistical support for many analysis methods. The amount of data available combined with the number of variables that need to be considered are of a scale far beyond what is amenable to manual inspection. Automated and semi-automated data analysis is thus essential to sieve through the data for meaningful conclusions. This process is variously called data mining, knowledge discovery, machine learning, inductive inference and others, depending on the discipline. We will use all these terms depending on the aspect of the analysis process we would like to emphasize. Below we will examine automated data analysis and some of its associated issues. Many of these issues are inherent in traditional statistical analysis before automation as well, but the necessary reliance on computerized methods applied to large databases exacerbates their import. Overviews and foundational issues regarding statistics in data mining and knowledge discovery are discussed in for example [Elder and Pregibon, 1996; Fayyad et al., 1996; Glymour et al., 1997; Hand, 2000; Breiman, 2001]. Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
1100
Choh Man Teng
2
KNOWLEDGE DISCOVERY FROM DATA
Knowledge discovery from data (KDD) aims to extract useful information from large amounts of data. The “knowledge” to be discovered can be broadly grouped into three, not mutually exclusive, veins, depending on the use the extracted information is to be put to.
Description For description, the goal is to provide an intensional summary of the extensional data given in the form of individual database records. For descriptive tasks, intelligibility is an important consideration. The target set of data points are to be summarized in a succinct and easily understandable way, such that its characteristic features can be easily grasped. Association rule mining [Agrawal et al., 1993] is an example of a descriptive task. Description is not in itself inductive, however. There is no uncertainty involved in the results, and good descriptions of a database do not necessarily possess inferential power for generalizing to data outside of the described target database.
Prediction For prediction, the goal is to produce a mechanism that will reliably foretell the value of a target feature from the values of other features of a new unseen instance. In contrast to description, prediction is concerned with induction and inference, and less emphasis is placed on providing a comprehensible description of the resulting model or method. In predictive tasks, accuracy of the predictions for new data instances is paramount. Black boxes that do not reveal the inner workings of the mechanism are quite acceptable as long as they provide accurate predictions of new data. For instance, neural networks are powerful but notoriously obscure predictors. They are universal approximators but it is not always easy to decipher the underlying predictive mechanisms.
Explanation For explanation, the goal is to recover the underlying mechanism by which the data is generated. Explanatory tasks aim at uncovering the causal relationships between variables. Variables that are highly correlated with the target variable are valuable as predictors. However, they are not necessarily direct or indirect causes of the target variable. This distinction is crucial in some situations, as while we can predict the value of a target variable based on other closely correlated variables, for instance, an effect variable of the target, only causes of the target can influence its value. Causal pathways allow one to determine how a desirable outcome may be achieved by strategically manipulating its direct or indirect causes.
Data, Data, Everywhere: Statistical Issues in Data Mining
1101
These three tasks are inter-related, but they differ in the characteristics required of the discovered knowledge and the prime criteria for evaluating whether the results are satisfactory. In practice, we desire some of each kind of characteristics. Results that are extremely accurate but entirely incomprehensible, or vice versa, are deemed wanting. In particular, we tend to bestow causal interpretations on the discovered relationships and provide and sometimes invent an explanation for the connections. These explanations may be warranted with background knowledge provided by subject matter experts, but sometimes they are contrived even when causation is not warranted. We also need to distinguish between making predictions from passively observing the characteristics of the data, and making predictions when external perturbations are introduced. In the former, (combinations of) features that are highly correlated with the target, for example, are valuable predictors. In the latter, when a variable is held or changed to a particular value through external intervention, the way this manipulation affects the target variable depends on the structural relationships between the variables. For instance, in an observational setting, observing a change in the value of an effect variable of the target will influence the prediction of the target value, whereas in a perturbation setting, a change in the value of an effect variable by external means has no bearing on the value of the target variable. This is in contrast to when the change is imposed on a variable that is a cause of the target variable, in which case the target value will vary accordingly. In addition, associations between the target and other variables may be changed or broken by the perturbation. The difference between these two prediction settings is a crucial consideration when we aim to bring about a particular state of the system through the enactment of some intervention action or policy. We will mostly concentrate on statistical prediction as related to automated knowledge discovery, but will touch on other aspects here and there. 3
MONKEYS AND TYPEWRITERS; BANGLADESHI BUTTER AND THE S&P 500
Friends of automated knowledge discovery advocate exploratory data analysis [Tukey, 1977]. Foes have another term for this activity: “data dredging” has often been used derogatorily to describe the unconstrained blanket exploration of data [Selvin and Stuart, 1966; Smith and Ebrahim, 2002, for example]. Chance associations and regularities are bound to turn up in the data if we are allowed to test for an unlimited number of patterns. Some of these patterns will be spurious, but if the analysis procedure has not been corrected for such effects, we could end up with dubious but seemingly valid “discoveries” that may not be justified on the basis of the given data. This is akin to the inverse scenario of the monkey pounding on a typewriter: not only that if the monkey types long enough, the complete works of Shakespeare will probably be produced, but if one looks long and hard enough, intelligible patterns
1102
Choh Man Teng
can be found in almost anything, including ink blots and any collection of pages randomly generated by the monkey. In a slightly less fantastical setting, in [Leinweber, 2007] a search was initiated for predictors of the S&P 500 index annual closing values over a ten year period. Using regression analysis, an R2 of 0.75 was obtained using a single predictor variable: 1: Butter production in Bangladesh. The fit of the model was improved to R2 = 0.95 by using instead the following two slightly extended variables: 1’: Butter production in Bangladesh and the United States; 2 : Cheese production in the United States. A further increase of the fitness to R2 = 0.99 was achieved by adding a third variable into the mix: 3: Sheep production in Bangladesh and the United States. Thus, to strike it rich it appears that we should pay great attention to dairy and farm products, especially those in Bangladesh. This case study was of course tongue-in-cheek: The butter fit was the result of a lucky fishing expedition. The rest comes from throwing in a few other series that were uncorrelated to the first one. Pretty much anything would have worked, but we like sheep. [Leinweber, 2007] The study went on to reveal that the regression model obtained was wildly inaccurate in predicting the stock index values outside of the given period used for the initial analysis, even though the regression model fitted the training data almost perfectly. In other words, the regression model obtained was purely, highly, descriptive but not predictive at all. This case study illustrates amusingly how easy it is to leverage chance associations and overfit the data. Models so obtained are good descriptors of the given data and only the given data. They have little predictive power beyond the data to which they have been specifically fitted. In less outrageous cases, it might not be as easy to spot the incongruence, particularly when much of the data analysis is automated. 4
UNDERFITTING AND OVERFITTING
Models learned from the given data may underfit or overfit. These phenomena relate to the bias-variance dichotomy (together with the noise component) in a way that is not completely well understood, although there are general rules of thumb on the tradeoffs between the two.
Data, Data, Everywhere: Statistical Issues in Data Mining
4.1
1103
Bias, Variance and Noise
The discrepancy between the value predicted for a variable and its actual value is attributable to three components: bias, variance, and noise. Bias refers to the average difference between the predicted value and the actual value. Variance refers to the variation arising from predictions obtained using different data samples drawn from the same population. Noise refers to the random variations inherent in the data. Bias and variance are related to the prediction model, whereas noise is a characteristic of the data. In regression analysis, the expected squared error of the estimated value can be decomposed additively into functions of the three components. For the prediction or classification of discrete variables, using for example a zero-one loss function instead of a squared loss function, the bias-variance-noise decomposition is not as neatly delineated. The decomposition of the error involves multiplicative terms whose interaction may explain the non-trivial behavior of some of the machine learning methods [Friedman, 1997; Domingos, 2000a; Domingos, 2000b]. The Bias-Variance Spectrum The noise component is a property of the data and cannot be directly influenced by the learning method. The bias-variance tradeoff however can be adjusted to some extent to achieve different prediction behaviors. In general models that underfit the data are of high bias, whereas models that overfit the data are of high variance. A decrease in one component tends to increase the other component, although not necessarily by the same amount. Neither underfitting nor overfitting is easy to detect on the basis of the data used to learn the model. However, underfitting is less of a problem in practice since overfitting tends to increase the descriptive accuracy of the model, which would seem desirable prima facie and thus is a more tempting pitfall. An increase in the descriptive accuracy however does not necessarily translate into an increase in the predictive accuracy, and in many cases leads to a decrease instead, as the variance in the sample data that has been minutely fitted by the learned model does not generalize to other samples or the population from which the sample was drawn. With enough finesse and models of sufficient degrees of freedom, we can always construct a model that describes the input data perfectly. For instance, suppose we use a polynomial curve fitting method to construct a predictive model from a set of sample data points with two variables. A linear model would underfit in many cases. The fit can be improved by increasing the order of the polynomial curve. In the extreme case, for n + 1 data points, there always exists an n-degree polynomial that exactly fits all the given data points. Such a curve is perfectly descriptive of the data points at hand. However, if these data points are merely a random sample of a much larger noisy population, or if the true model is not really polynomial in the nominated independent variable, the n-degree polynomial is most likely overfitting and would predict poorly for
1104
Choh Man Teng
data points outside of the given sample. This problem is magnified when the data sample is noisy and the perfect curve that passes through all the data points is fitting the noise rather than the signal.
4.2
The Reference Class
Finding the optimal balance along the bias-variance spectrum can be thought of as a variant of the reference class problem: If we are asked to find the probability holding for an individual future event, we must first incorporate the case in a suitable reference class. An individual thing or event may be incorporated in many reference classes. . . [Reichenbach, 1949] A single case is a member of many reference classes. A data point with n known binary attributes belongs to 2n reference classes, for example. Each of these reference classes has its own distributional properties, many of which are not compatible with those derived from other reference classes. In addition, when these reference statistics are compiled from sample data, not all of them are available and not all of them are equally well supported by the data. The challenge then is to identify the appropriate reference class or reference classes among the competing classes to arrive at a justifiable inference for the target case. This is analogous to finding the model encompassing the appropriate bias-variance split. Too broad a reference class and we will underfit; too narrow and we will overfit. Reichenbach’s recommendation for the reference class problem was to . . . proceed by considering the narrowest reference class for which reliable statistics can be compiled. [Reichenbach, 1949] whereas Salmon proposed . . . the single case should be referred to the broadest homogeneous reference class of which it is a member. [Salmon, 1967] The two criteria are not as contradictory as they seem when not quoted out of context (guilty here!). The intuition is that among all the reference classes to which the single case in question belongs, the appropriate reference class should be as narrow as possible, but no narrower than what the accompanying statistics can demonstrate to be relevant. In other words, any narrower the reference class will be unsuitable as the basis of inference with respect to the target case. There are two ways a reference class can be too narrow to be suitable: the lack of statistical support and the lack of relevance. It would appear that the more specific the reference class, the better matched it is to the target case in question, and thus we should always prefer the narrowest reference class. However, as the reference class becomes narrower, the amount of
Data, Data, Everywhere: Statistical Issues in Data Mining
1105
data belonging to this reference class becomes smaller. In the extreme case we may arrive at a reference class that matches the target case very closely but contains no known data points, rendering total ignorance with regard to the target prediction. The second way a reference class can be too narrow is by including features that are irrelevant to the target prediction. Again as we add more irrelevant features, the corresponding reference class maps to fewer known data points, but the underlying probability distribution of the target property does not differ substantively from the distribution in the reference class without the irrelevant features. Thus, even if the narrower reference class still contains enough data points, the inclusion of irrelevant features effectively increases the variance of the prediction without increasing its accuracy. Reference classes are not always nested and sometimes there is no single narrowest suitable reference class. The reference class formulation was further extended in [Kyburg, 1974; Kyburg and Teng, 2001].
4.3
Competing Hypotheses
The reference class problem surfaces in a number of different settings. In nonmonotonic reasoning, multiple inheritance leads to multiple extensions in, for example, default logic [Reiter, 1980]. Taxonomic hierarchies are used to resolve inheritance conflicts between rules that give competing conclusions [Etherington and Reiter, 1983; Touretzky, 1984]. In machine learning, a model or classifier is abstracted from a data set and predictions about new data instances are made according to the model. Choosing among competing models may be considered a generalized reference class problem. Some classifiers can be thought of as algorithmic blueprints for picking out a reference class to be used for the prediction of each incoming new instance. For example, a decision tree [Quinlan, 1993] guides a new instance down successive branches of the tree at each level according to specific features possessed by the new instance. The path from the root to a leaf in the decision tree encapsulates the set of feature criteria that determine the appropriate reference class for the test instance. Occam’s razor is often used in the KDD literature to justify preferring simpler hypotheses to more complicated ones. For example, the pruning of decision trees aims at reducing the size of the tree by removing branches at the bottom that are deemed to be overfitting the data. Stepwise logistic regression and Markov blanket procedures are developed to identify the features relevant for the prediction task at hand from among all the input features. At first glance the Occam’s razor principle appears to run counter to the directive for preferring the more specific reference class. A simpler model translates to preferring the less complicated, more general, models that take into account fewer attributes. However, simplicity does not automatically translate into relevance. Selecting for a simpler model is justified only when irrelevant variables are included in the more complicated models to overfit the noise and sampling variance.
1106
Choh Man Teng
There have been theoretical arguments and supporting experimental evidence indicating that with the same training error, the simpler model is not always the better [Domingos, 1998, for example].
5
TESTING AND EVALUATION
Many data mining procedures involve administering statistical tests repeatedly. For example, in stepwise regression, variables are selected by their statistical significance in successive groupings. For decision trees, bottom leaves and branches are pruned iteratively according to a statistical criterion. In addition, hypotheses abstracted from the data need to be evaluated for predictive accuracy, and at times many hypotheses are tested together or serially in search of the “nugget” in data mining. We will examine below some issues pertaining to the statistical significance of a test result.
5.1
Hypothesis Testing
In hypothesis testing, typically a single, simple, null hypothesis is tested against difference in all directions. The null hypothesis is rejected when the chance of false rejection is less than a specified level α. The value of the parameter α is meant to be calibrated with respect to the test setting. For example, the larger the sample size, the higher the power of the test and the easier the null hypothesis is rejected. This problem can be counteracted by lowering the value of α. However, either because the guideline is vague or because the quantification is complex, in practice many studies adopt a standard parameter value of 0.05 or 0.01 for α. The use of these conventional values is seldom questioned, but adopting any value other than these two defaults necessitates a thorough defense of the choice, which may additionally discourage efforts to calibrate the parameter. Multiple hypothesis testing further complicates the determination of a suitable value for the parameter α. Each hypothesis, independently tested at level α, stands a chance of α of being falsely rejected. Even if all the null hypotheses are true, at a level of 0.05 it is to be expected that 5 out of every 100 hypotheses tested would be falsely rejected simply by chance. The chance of obtaining at least one false rejection from multiple hypothesis tests is thus higher, and sometimes much higher, than the stipulated threshold α. For 100 independent hypotheses each tested at a level of 0.05, the probability of at least one false rejection is 99.4%: we can be practically certain of having committed the error. Multiple hypothesis testing relates to the problem of overfitting discussed in the previous section. Test enough hypotheses and sooner or later we are bound to hit upon one that by sheer chance fits the training data well.
Data, Data, Everywhere: Statistical Issues in Data Mining
5.2
1107
Corrections and Approximations
Various methods can be adopted to more reasonably bound the overall chance of error for multiple testing. This is often accomplished by lowering the acceptable chance of error with which each individual hypothesis in a multiple testing setting can be rejected. Instead of utilizing an analytic bound, it is also common in machine learning to validate a hypothesis experimentally with additional data not used for constructing the hypothesis in the first place. Statistical Corrections The level of significance parameter α can be corrected to reflect the testing of multiple hypotheses. One commonly used statistical correction is the Bonferroni method: For n hypotheses tested at the level α, each hypothesis is rejected if its chance of error is less than α/n. The Bonferroni method is a rather conservative method for handling the estimation of the chance of error in multiple testing. This method essentially provides a provably correct upper bound on the overall chance of error of falsely rejecting at least one hypothesis in the whole set of hypotheses considered. This upper bound holds regardless of the relationships between the hypotheses, but in many cases the bound may be higher than the true bound. In areas such as genomics, it is not uncommon to test thousands or even millions of hypotheses at a time. To achieve an overall bound of α on the chance of committing a false rejection error, according to the Bonferroni method the individual hypotheses need to be tested at a level of a millionth of α, which may be all but unsatisfiable. Other procedures have been proposed which impose less stringent individual significance levels, for instance the Holm-Bonferroni method [Holm, 1979] and the False Discovery Rate method [Benjamini and Hochberg, 1995], with alternative characterizations of the control of the chance of error. Experimental Methods For computationally expensive error estimation procedures such as permutation methods, the chance of error can be approximated by experimental methods. For example, Monte Carlo resampling can be used when the number of possible permutations is too large for exact computation. Random permutations are repeatedly composed from the data and the test statistics are compiled from these sampled permutations rather than from the whole enumerated set of permutations. The predictive accuracy of an hypothesis learned from a data set can be evaluated alternatively by testing the hypothesis against unseen data. Ideally, the evaluation is performed on data collected in repeated trials, or, a portion of the data in a single trial is reserved for validation purposes only. Resource constraints
1108
Choh Man Teng
often do not allow either, in which case cross-validation procedures are engineered to extend the use of the available data. In an n-fold cross validation, the data set is partitioned into n (almost) identical parts. In each of the n trials, an hypothesis is learned from n − 1 parts of the data and tested on the remaining one part. n trials are performed and n results are obtained from such an arrangement. The individual trials are correlated to some extent, since the input data are reused among them, but the predictive error rate can be estimated from the cross validation test results. Even though in each cross validation trial the training and testing data are separate, meta-level overfitting can occur when a large number of models are trained and tested on the supposedly unseen test sets. Again, perform enough trials and one of the models would fit the “unseen” test data well. This concern led to an approach that splits the data into three parts, one part for training, a second part for testing, and a third part for validation. Models vetted by cross validation are evaluated against the still more unseen validation data set. While the use of a validation set further lowers the effects of data snooping, overfitting may still occur at the meta-meta level.
5.3
Adjunction
The problem of multiple testing can be restated as the problem of adjunctive inference, as exemplified by the lottery paradox [Kyburg, 1961]. Briefly, the lottery paradox can be paraphrased as follows.1 Consider a fair lottery with a million tickets. One of the tickets will be drawn as the winner. The chance for a particular ticket, for example ticket number 37841, to win is one in a million, a number so small that we can be practically certain that it will lose. The same line of reasoning can be applied to all one million of the tickets available in the lottery. At the same time one of these tickets is guaranteed to be drawn as the winner. These accepted statements taken conjunctively are inconsistent: every individual ticket will lose, yet one of these tickets will surely win. Worse, because of the symmetry it is not clear whether any one of these accepted statements should be rejected, even though they are jointly inconsistent. Removing any one of the statements will restore consistency; yet there is no one statement that seems to be less justified to be accepted than the others2 , and any choice we make will seem arbitrary and irrational. Analogous problems arise qualitatively in the paradox of the preface [Makinson, 1965] and in default reasoning [Poole, 1989]. The problem encountered in multiple testing follows the same structure. Each hypothesis considered by itself has an 1 The
lottery may perhaps more properly be described as a raffle. counting the statement stating that one of the tickets will be the winning ticket, unless we also consider the possibility of fraud in the lottery setup. 2 Not
Data, Data, Everywhere: Statistical Issues in Data Mining
1109
acceptable chance of false rejection, yet we can be practically certain that among a reasonably large repertoire of hypotheses at least one has been falsely rejected, even though there may not be any one particular hypothesis that can be reasonably singled out as the culprit. The statistical corrections to the level of significance parameter α for multiple testing are in response to the failure of adjunction. When multiple conclusions are conjoined, the resulting compound conclusion is not warranted to the same degree as the individual conclusions. For the conjoined conclusions to be warranted at the original desired level α, the member conclusions need to be subject to more stringent thresholds, for example α/n for n individual hypotheses according to the Bonferroni method. Similar problems are encountered by models learned inductively from sample data. Each statistically sound predictive rule incurs a small, acceptable, chance of error. Among a large number of these predictions accumulated from repeatedly applying individual predictive rules, we can be practically certain that there is at least one false prediction. So far there does not appear to be a ready made solution to this problem. Adjunction is a very persuasive form of inference. Rather than giving it up completely to avoid inconsistent beliefs, perhaps it is more pragmatic to correct for the error bound whenever we can, and be prepared to retract some of the conclusions in any case, when these conclusions turn out to be false even though they appeared to be warranted initially. 6
ASSUMPTIONS AND VIOLATIONS
Theoretical analysis provides bounds and optimality results based on certain assumptions about the nature and distribution of data. These assumptions are more often than not violated in practice. Although such violations should prompt us to rethink and re-qualify the assumptions where possible, they should not discourage us from applying the methods with suitable precautions and a fair understanding of the robustness of the methods.
6.1
Random Sampling
Given a set of data, a model is constructed and used to predict future or unseen events. For a model to be predictive, we need the future to be just like the past, and we need the data used to construct the model to be a representative sample of the parent population. This notion is usually captured as a random sampling assumption: the data observations are independent and identically distributed (i.i.d.) with respect to the parent population. Data collection is often opportunistic. Data are collected whenever and wherever they can be collected. They may consist of Psychology 101 students, or birds that were spotted on particular days, or households who have access to the internet (so they may participate in online polls). These convenience samples are hardly
1110
Choh Man Teng
i.i.d. samples. Not taking into account the provenance of the data and instead treating the data as an i.i.d. sample will distort the inferences drawn from the data. The problem goes deeper than that. Even in the best of circumstances (other than artificial formulations), i.i.d. samples simply cannot be had. Consider the case of Psychology 101 students. The parent population is usually not the restrictive set of students in Psychology 101 classes, or all college students, or even the far more expansive “people here and now”. The inferences drawn from the data on the behavior of Psychology 101 students in many cases are intended to implicate the behavior of the all inclusive set of human beings, geographically and temporally dispersed. Even if we can sample members independently and with equal probability from all corners of the world, future members of the human population cannot be available for i.i.d. sampling at the present time at any cost. Yet prediction often are concerned with future events (although we can also aim to predict retrospective or current events that are not included in the sample data). Future events are then members of the parent population that are inaccessible for sampling, random or otherwise. Thus the i.i.d. sampling requirement is at best only an idealized requirement. Not only do we not have good reasons for considering a data sample to have been drawn i.i.d. over the entire parent population, but we usually have strong evidence against such a claim.
6.2
Representative Sampling
Our goal in sampling is not to obtain a random sample per se, but to obtain a sample representative of the parent population, which will allow us to infer from the characteristics of the sample to the characteristics of the population, based on the argument that the sample is “similar” to the rest of the population. Other than in the most trivial cases, it is not possible to establish the representativeness of a sample, as that would require knowing the composition of the parent population in the first place, in which case we do not need to go to the trouble of drawing a sample and performing an inference from the sample to the population. A random sample is often used as a stand in for a representative sample. Random sampling however is neither necessary nor sufficient to ensure a representative sample. A random sample might still be skewed, as it must happen every now and then as a mathematical fact, and a non-random sample might have the appropriate proportion of elements, either by careful crafting of the sample or just by chance. Nonetheless, random sampling allows us to relate to the representativeness of a data sample in a mathematically rigorous way. The degree of similarity between the sample and the population can be quantified by the characteristic variations inherent in i.i.d. sampling. Still, the i.i.d. sampling assumption is provably false in many situations. Instead of insisting on an untenable assumption (or effectively ignoring the implications
Data, Data, Everywhere: Statistical Issues in Data Mining
1111
of accepting a false statement), we may consider reformulating the requirement as a weaker, but not as crisp, requirement on the data sample. The data sample does not need to be provably random or even highly probably random. Rather we require the absence of strong evidence against the representativeness of the sample. While we cannot prove representativeness, where appropriate we can cast very reasonable doubt upon it and sometimes even falsify it. With such a requirement, good sampling practice then is followed not to ensure the impossible i.i.d. sampling, but to serve as a safeguard against some obvious sampling biases. However, even if we adhere to all good sampling practice, if there are reasons to cast doubt on a sample’s representativeness, the statistical inference from the sample to the population should justifiably be undermined.
6.3
Nonmonotonic Inference
The above formulation is in the tradition of nonmonotonic reasoning. In default logic [Reiter, 1980], for example, a distinction is made between prerequisites, which need to be provably true, and justifications, which merely need to be plausible. A default rule is triggered if its prerequisite is provably true and its justifications are each not provably false. If later one of the justifications is found to be false, the default conclusion has to be retracted. Random sampling, traditionally cast in the form of a default prerequisite, is not satisfiable. A more reasonable formulation would be to require as a prerequisite the sample be drawn by following “good sampling practice”. Good sampling practice does not ensure random or representative samples, but its protocols are relatively well defined and falsifiable, and adherence to such protocols can serve as positive evidence supporting the fitness of the sample. The representativeness of a sample cannot be verified and thus cannot take the form of a prerequisite. Rather, we can require as a default justification that we not know that the sample drawn is not representative. This requirement can be embodied in a number of conditions that should not be known to be false, even though we may not know them to be true either. For example, to determine the average IQ of people, if it comes to our attention that all the members of the sample, drawn carefully following good sampling practice, happen to be members of mensa, we should not proceed with the inference. The inference would be blocked as the default justification that the sample be representative becomes suspect. Note that this can happen even with truly i.i.d. sampling, as samples comprising of only mensa members are equally likely to be drawn as samples of other compositions (unless the sample is so large that there are fewer mensa members than sample members). On the other hand, it is not necessary to establish positively that the sample is representative, a condition which is all but impossible to prove in most cases. With these prerequisites and justifications, the conclusion of the default rule is then the permissible statistical inference from the sample to the population.
1112
Choh Man Teng
Similarly, these conditions for statistical inference can be captured in a modal logic framework. 7
THE CASE OF ASSOCIATIONS IN ASSOCIATION RULES
One of the predominant subjects in data mining is the extraction of association rules, first introduced in [Agrawal et al., 1993]. A well-known application is the analysis of market basket data to discover customer buying patterns based on register transactions. An association rule is a rule of the form X ⇒ Y , intended to capture dependencies between, for example, items that are likely to be purchased together in a single transaction. A well known and well cited, but unattributed, discovered rule is diapers ⇒ beer, which suggests, loosely, that among customers surveyed, those who bought diapers also tend to buy beer at the same time. The interpretation of this association rule is left as an exercise for the reader. More formally, the rule X ⇒ Y says that if event X occurs in the database, so does event Y with a certain level of frequency, as measured by two values, support and confidence: support: baskets in which X and Y appear together (as the percentage over all baskets in the database); confidence: occurrences of Y among those baskets containing X (as the percentage over all baskets that contain X in the database). Each basket consists of a set of items from a given domain, for example, a set of items purchased together in a single register transaction among all items that are available for purchase at this store. The database is given as a set of baskets, or a set of sets of items. The objective of association rule mining is to find all the rules with support and confidence values above some user-defined thresholds.
7.1
Description
Association rules are sometimes advanced as rules of inference and used in a predictive setting. For example, rules concerning the associations found between items in a market basket analysis are supposed to reflect consumer behavior in general, and it was suggested that the discovered rules can guide business decisions such as the running of store promotions and new product placement in the store. [Agrawal et al., 1993] Statistically speaking, this stance is problematic. Association rules are purely descriptive: they present a summary of the relationship between the attributes, as manifested in the instances in the existing data set. A rule X ⇒ Y , with 1% support and 90% confidence for example, indicates that at least 90% of X’s found in the given data set are also Y ’s, and that X’s and Y ’s co-occur in the data set at least 1% of the time. These rules however say nothing conclusive about the relationships between attributes in the instances that are not part of the given
Data, Data, Everywhere: Statistical Issues in Data Mining
1113
data set. They merely report verbatim patterns found in the static data set at hand. Mining association rules is a clearly defined task. The objective there is to generate all rules of the form X ⇒ Y which are above some given support and confidence thresholds. The problem of evaluation and validation is thus reduced to one of correctness and efficiency. Correctness in this case is unambiguous. Any algorithm is required to return the set of rules meeting the given support and confidence criteria. Since there is no difference between the set of rules returned by one algorithm and the next in the standard association rule framework, much of the research effort in this area has been understandably focused on efficiency issues, in terms of time and storage. High performance association rule mining aims at overcoming the challenges imposed by the tremendous size of the data sets involved and the potential number of rules that satisfy the mining criteria [Savasere et al., 1995; Agrawal et al., 1996, for example]. Another line of research in this area is to find from among the complete set of association rules a subset of interesting, novel, surprisingly, anomalous or otherwise more noteworthy rules [Bayardo and Agrawal, 1999; Klemettinen et al., 1999, for example]. Still, this amounts to selecting a subset of descriptive rules which may be smaller but with no more predictive power than the full set of rules.
7.2
Prediction
In prediction, the goal is to infer about the characteristics of instances we have not seen, instances that are outside of the original data set, either because these instances have not been sampled for our data set, or because these instances are in the future. We would like the rules we derive from the given data set to be indicative of the patterns that can be found in a much larger domain. The target domain may be the set of all potential transactions, for example, including both those performed by current customers who did not shop on the day we collected the data, as well as those that will be performed by future customers who have not shopped here yet but one day soon will. (We note also that the data collected are usually a convenience sample, with all its attendant problems. For example the data set may consist of all transactions that occurred on a particular day in a particular store, among all days and all stores in a chain.) The target domain thus is not only much larger than the given data set, but also is typically of infinite cardinality and extending into the future. Examining the whole population is out of the question in many cases of interest, both because of the size of the population, and also because many members of the population, such as those that exist in the future, cannot be available for examination here and now. In the association rule framework, the minimum support requirement is expressed as a relative proportion of the given data set. This does not make statistical sense in the predictive setting, when the data set is only a sample of our target population, which in addition may be infinite. Rather, statistical support is
1114
Choh Man Teng
based upon the absolute number of instances that satisfy a certain condition, for example the antecedent of a proposed rule. Furthermore, this support varies from rule to rule, according to the content and absolute number of occurrences of the rule antecedent. This is in contrast to the support value for standard association rules, which is always specified relative to the size of the whole data set. This is not a matter of tweaking the support and confidence thresholds to arrive at a different set of association rules. The problem lies in distinguishing between a rule that is justifiably acceptable and one whose supporting evidence is inconclusive. It is not the case that While confidence is a measure of the rule’s strength, support corresponds to statistical significance. [Agrawal et al., 1993] The support criterion as construed in the standard framework is inapplicable for statistical inference. In practice, however, the standard approach has been widely used in many situations, for both description and inference, with little appreciable difficulty. Depending on the utility and the sensitivity of the application, the descriptive support and confidence of the discovered rules may suffice as approximate indicators of their predictive capability in the larger intended population. This is especially true when the sample size in question is large, which is bolstered in the standard association rule framework by the combined effect of huge data sets and a reasonably high support threshold. 8
REPLACING STATISTICIANS WITH COMPUTERS
Traditionally experienced statisticians and data analysts are the custodians of the etiquette of statistical analysis. We rely on expert judgement and common sense to ensure analysis protocols and inference procedures are properly executed, and to detect potential methodological and experimental trouble spots. Common sense stops us from deriving the eating habits of the general public from polling people in a Buddhist temple, or using the butter production in Bangladesh to predict the S&P 500. Experts can in many cases separate spurious associations from relevant ones and extract meaningful patterns with insight into the data. We may consider replacing statisticians with computers in automated data analysis and knowledge discovery. There are pros and cons in such an arrangement, as discussed below.
8.1
The Flip Side
The voluminous amount of data available for analysis these days poses a cognitive barrier for applying human judgement. Analysis is increasingly carried out by automated or semi-automated processes. The human analyst is increasingly removed from the nitty-gritty of the data. None of the issues discussed above are very new, but with automated data analysis and knowledge discovery these issues
Data, Data, Everywhere: Statistical Issues in Data Mining
1115
have come to the forefront as computers for the most part lack both common sense and expert judgement. There are efforts to formalize common sense and background knowledge, for example the project Cyc [Lenat, 1995], as well as efforts to develop objective and quantifiable measures of intuitive but non-trivial concepts such as what knowledge discoveries are considered “interesting” or “surprising” [Silberschatz and Tuzhilin, 1996; Chakrabarti et al., 1998, for example]. It is hoped that computerized background knowledge can substitute for human common sense and expert judgement to some extent, or at least help cut down on the induced search space of the analysis task.
8.2
The Upside
Even though the state of the art is still far from capturing human insight and judgement, and a considerable amount of background knowledge would be needed for general reasoning and inference, in the near term usable knowledge bases can be constructed for relatively restricted domains. For scientific inference and knowledge discovery, at least initially we typically do not need to worry about emotions, metaphorical implications, and the like. The analysis of scientific data is considerably simplified when it is confined to a specialized and well defined domain. On the upside, the use of computers to automate at least part of the data analysis process has made feasible a large repertoire of methods that were too computationally intensive when calculations had to be done entirely by hand. For example, factor analysis was preferred to tetrad methods because tetrad constraints are difficult to compute manually even for problems of moderate size [Guilford, 1954]. In addition, complex, high dimensional problems, frequently encountered in for example bioinformatics, do not easily yield an analytic solution, but they can be tackled numerically by Monte Carlo methods assisted by computers.
8.3
On Balance
Data are often too much or too complex to allow human insight. The automation of knowledge discovery allows us to discover more patterns more rapidly than ever. Computers however do not possess the expert judgement and common sense to choose the appropriate analysis methods and to interpret the significance and relevance of the findings. Without expert oversight, bizarre and nonsensical findings from the institutionalized process can easily pass as sober findings. To realize the potential of automated knowledge discovery, we need to install explicit safeguards against misuse of methods and data. The data characteristics and conditions necessary for an analysis method need to be articulated formally, such that the analyst can be alerted when assumptions are violated, data are suspect, and matters are not as they should be. Analysis methods developed for application under one set of conditions are routinely used even when the assumptions are violated. For instance, data distri-
1116
Choh Man Teng
butions are often not Gaussian, yet normality are often assumed in the methods applied. Similarly, the construction of naive Bayes classifiers depends on the independence between attributes, which is often patently false. We need to understand how robust an analysis method is in terms of the amount of deviation from the assumed conditions that can be tolerated, and the magnitude and nature of the effect of such deviations. On the meta-level, there is the challenge of learning how to learn. Expert analysts select the methods to apply based on the problem and data at hand. This process may involve multiple steps and different methods at each step. Automated knowledge discovery systems could learn this selection from a repository of data and problem characteristics together with the analysis methods applied, provided that these meta-data are properly articulated and formalized. The issues discussed here and elsewhere will be of paramount relevance at this level as well.
BIBLIOGRAPHY [Agrawal et al., 1993] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on the Management of Data, pages 207–216, 1993. [Agrawal et al., 1996] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. Fast discovery of association rules. In U. Fayad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI Press, Menlo Park, CA, 1996. [Bayardo and Agrawal, 1999] R. Bayardo and R. Agrawal. Mining the most interesting rules. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 145–154, 1999. [Benjamini and Hochberg, 1995] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57(1):289–300, 1995. [Breiman, 2001] Leo Breiman. Statistical modeling: The two cultures (with comments and rejoinder). Statistical Science, 16(3):199–231, 2001. [Chakrabarti et al., 1998] Soumen Chakrabarti, Sunita Sarawagi, and Byron Dom. Mining surprising patterns using temporal description length. In Proceedings of the 24th International Conference on Very Large Data Bases, pages 606–617. Morgan Kaufmann, 1998. [Domingos, 1998] Pedro Domingos. Occam’s two razors: The sharp and the blunt. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 37–43, 1998. [Domingos, 2000a] Pedro Domingos. A unified bias-variance decomposition and its applications. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 231– 238, 2000. [Domingos, 2000b] Pedro Domingos. A unified bias-variance decomposition for zero-one and squared loss. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 564–569, 2000. [Elder and Pregibon, 1996] John F. Elder, IV and Daryl Pregibon. A statistical perspective on knowledge discovery in databases. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 83–113. AAAI Press, 1996. [Etherington and Reiter, 1983] David W. Etherington and Raymond Reiter. On inheritance hierarchies with exceptions. In Proceedings of the Third National Conference on Artificial Intelligence, pages 104–108, 1983.
Data, Data, Everywhere: Statistical Issues in Data Mining
1117
[Fayyad et al., 1996] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery: An overview. In Usama M. Fayyad, Gregory PiatetskyShapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 1–34. AAAI Press, 1996. [Friedman, 1997] Jerome H. Friedman. On bias, variance, 0/1-loss, and the curse-ofdimensionality. Data Mining and Knowledge Discovery, 1:55–77, 1997. [GenBank, 2010] National Center for Biotechnology Information. NCBI-GenBank Flat File Release 179.0. Distribution Release Notes, 2010. [Glymour et al., 1997] Clark Glymour, David Madigan, Daryl Pregibon, and Padhraic Smyth. Statistical themes and lessons for data mining. Data Mining and Knowledge Discovery, 1:11– 28, 1997. [Guilford, 1954] J. P. Guilford. Psychometric Methods. McGraw-Hill, second edition, 1954. [Hand, 2000] David J. Hand. Methodological issues in data mining. In Proceedings in Computational Statistics (COMPSTAT), pages 77–85, 2000. [Holm, 1979] Sture Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6:65–70, 1979. [Klemettinen et al., 1999] Mika Klemettinen, Heikki Mannila, and A. Inkeri Verkamo. Association rule selection in a data mining environment. In Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery, pages 372–377, 1999. [Kyburg and Teng, 2001] Henry E. Kyburg, Jr. and Choh Man Teng. Uncertain Inference. Cambridge University Press, 2001. [Kyburg, 1961] Henry E. Kyburg, Jr. Probability and the Logic of Rational Belief. Wesleyan University Press, 1961. [Kyburg, 1974] Henry E. Kyburg, Jr. The Logical Foundations of Statistical Inference. Reidel, Dordrecht, 1974. [Leinweber, 2007] David J. Leinweber. Stupid data miner tricks: Overfitting the S&P 500. Journal of Investing, 16(1):15–22, 2007. [Lenat, 1995] Douglas B. Lenat. CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11):33–38, 1995. [Makinson, 1965] David Makinson. The paradox of the preface. Analysis, 25:205–207, 1965. [MODIS, 1999] National Aeronautics and Space Administration, Goddard Space Flight Center. TERRA: The Earth Observing System (EOS) AM-1, 1999. [Poole, 1989] David Poole. What the lottery paradox tells us about default reasoning. In Proceedings of the First International Conference on Principles of Knowledge Representation and Reasoning, pages 333–340, 1989. [Quinlan, 1993] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [Reichenbach, 1949] Hans Reichenbach. The Theory of Probability. An Inquiry into the Logical and Mathematical Foundations of the Calculus of Probability. University of California Press, second edition, 1949. [Reiter, 1980] R. Reiter. A logic for default reasoning. Artificial Intelligence, 13:81–132, 1980. [Salmon, 1967] Wesley C. Salmon. The Foundations Of Scientific Inference. University of Pittsburgh Press, 1967. [Savasere et al., 1995] Ashoka Savasere, Edward Omiecinski, and Shamkant Navathe. An efficient algorithm for mining association rules in large databases. In Proceedings of the 21st Conference on Very Large Databases, pages 432–444, 1995. [Selvin and Stuart, 1966] Hanan C. Selvin and Alan Stuart. Data-dredging procedures in survey analysis. The American Statistician, 20(3):20–23, 1966. [Silberschatz and Tuzhilin, 1996] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8(6):970–974, 1996. [Smith and Ebrahim, 2002] George Davey Smith and Shah Ebrahim. Data dredging, bias, or confounding. British Medical Journal, 325:1437–1438, 2002. [Touretzky, 1984] David S. Touretzky. Implicit ordering of defaults in inheritance systems. In Proceedings of the Fifth National Conference on Artificial Intelligence, pages 322–325, 1984. [Tukey, 1977] John W. Tukey. Exploratory Data Analysis. Addison-Wesley, 1977.
This page intentionally left blank
Part XIII
An Application of Statistics to Climate Change
This page intentionally left blank
AN APPLICATION OF STATISTICS IN CLIMATE CHANGE: DETECTION OF NONLINEAR CHANGES IN A STREAMFLOW TIMING MEASURE IN THE COLUMBIA AND MISSOURI HEADWATERS Mark C. Greenwood, Joel Harper and Johnnie Moore Recent analyses of regional snow and river data for the late 20th century have suggested that western North American snow pack has decreased [Mote et al., 2003] and the spring snow melt pulse has arrived earlier [Stewart et al., 2005]. These papers investigate linear change over this time period allowing the results to differ across locations by using distinct linear regression models at each location. Moore et al. [2007] found less conclusive evidence for linear change when adjustments for multiple testing were considered but used similar statistical models otherwise. A class of nonparametric spatial-temporal models for the same data used in Moore et al. [2007] are used to estimate a common regional level trend while accounting for spatial and temporal dependencies, allowing evidence for linear and nonlinear regional level trends to be compared. We find evidence for a nonlinear regional level trend between 1951 and 2005, generally demonstrating earlier arrival over the study period as well as a more complex pattern of change than previously reported. 1
INTRODUCTION
Detecting the impact of climate change in natural systems is typically hampered by limitations caused by short time series, especially for instrumental records. In some systems, high levels of natural variability make the task particularly difficult due to low signal to noise ratios. To obtain more precise estimates of the average change, information can be aggregated across many locations to provide an average change in the system. By borrowing information across locations we can often find a less variable estimate of the average. When information is aggregated across locations, correlations between locations need to be considered to properly estimate the precision in the estimate. Many researchers trying to assess change have assumed that the change is linear over time. In some natural systems over moderate time scales that may be reasonable, but over long enough time frames that assumption certainly becomes problematic. The models used in this work
Handbook of the Philosophy of Science. Volume 7: Philosophy of Statistics. Volume editors: Prasanta S. Bandyopadhyay and Malcolm R. Forster. General Editors: Dov M. Gabbay, Paul Thagard and John Woods. c 2011 Elsevier B.V. All rights reserved.
1122
Mark C. Greenwood, Joel Harper and Johnnie Moore
incorporate correlations over space and time and compare models that assume linearity to those that estimate trends nonlinearly using a nonparametric regression technique. We propose a method to assess the evidence for nonlinear trends in the timing of streamflow in the headwaters of the Missouri and Columbia Rivers. Streamflow timing refers to the arrival date of characteristics of the yearly distribution of streamflow, most often the mean or median of the flow. We consider the day of the median or when half the yearly flow has occurred and look for changes in that timing from 1951 to 2005. Streamflow timing in gage locations such as ours, that are high elevation and not disturbed by irrigation practices, should primarily record climate change induced trends in snowmelt. The conjecture is that earlier timing measures should be observed over the study period due to warmer temperatures causing earlier snowmelt. The details of this trend and statistical evidence regarding it are the main concerns of this work, which extends Moore et al. [2007] into nonparametric spatial-temporal models. 2
PREVIOUS RESEARCH
Estimating the amount of change that has occurred due to global warming is a challenging problem due to relatively short physical records and relatively high natural variability in the systems. One approach that has often been employed is to estimate the amount of change at each location and test for statistical evidence for the presence of that change. This provides many localized estimates of change and can lead to problems with compounding of statistical error from using many hypothesis tests. Numerous researchers have searched for evidence of climate change in the Western US, focusing on different measures that might record that change. One area of this research involves analyzing streamflow discharge in high elevation stream gages, searching for changes in the shape or timing of the discharge over time. Stewart et al. [2005] and Moore et al. [2007] consider different streamflow metrics and use simple linear regression to assess evidence of a linear change over the last half century in Western U.S. rivers. The results of Stewart et al. [2005] are often cited as evidence of earlier snowmelt in this region. Moore et al. [2007] also found some evidence of earlier streamflow timing but found less convincing evidence of that change after adjusting for the number of tests considered. There are two issues with using site specific analysis techniques, one is the number of hypothesis tests to consider and the other is the difficulty in synthesizing those results to provide a regional level result. In analyzing streamflow timing measures for evidence of climate change there are a number of issues to consider in the choice of analysis technique, especially related to the site specific analyses. The first is the choice of metric used to measure timing, the second is to consider the number of hypothesis tests considered (which scales with the number of locations and response variables considered), third is whether linearity is a reasonable assumption for the trends, and fourth is whether local analyses can provide the overall inference that is really of interest. These
Statistics in Climate Change
1123
issues are considered in order along with solutions to each considered in this work. First, there is the question of the timing measure(s) to use. Stewart et al. [2005] use a peak detection algorithm and P whatPthey describe as the center of mass of timing (CT). CT is defined as i ti qi / i qi where ti is the day P of the water year for i=1, . . . , 365, qi is the daily discharge. If we let totalQ= i qi be the total annual discharge, CT can be re-written to show that it is the average of the day of the water yearPweighted by the P proportion of P the annual flow on that qi day (pi ) as follows: CT= i ti qi /totalQ = i ti totalQ = i ti pi . This provides a day of the average flow for the water year. Hydrographs (plots of daily flow) are typically defined by the water year, starting Oct. 1, which is typically during the lower flow season in the areas in this study. Flows increase rapidly when the snow begins to melt in late winter and then flows taper off as the snow is exhausted and precipitation is low in the later portion of the summer. This tends to result in a highly skewed distribution as displayed in Figure 1 due to the choice of the starting date in the water year. It is well known that the mean, to retain its center of mass interpretation, moves in the direction of the long tail of the distribution (see, for example, [DeVeaux et al., 2008]). The median is a better measure of the center in skewed distributions. We denote the day of the median as D50, calculated as PD50 i=1 pi = 0.5. Additionally, outliers have more impact on the mean than on the median. Outliers could easily be generated in the low flow portions of the year due to precipitation events, dragging the mean in the direction of those unusual events. A final note on CT relates to the arbitrariness of the beginning of the water year. If, for some reason, this were to be re-defined, then the CT could be dramatically impacted whereas the median, which we use here, would be more robust to these changes. To illustrate this, the flow records from USGS Gage 13186000, South Fork of the Boise River, near Featherville, ID at an elevation of 4218 ft, are shown in Figure 1. The plot contains the daily hydrographs and cumulative discharge functions [Greenwood et al., 2007] for 1951 and 2005 at this location. The CT for 1951 was day 218.1 and D50 was 229 and for 2005 the CT was 215 and D50 was 232. The sensitivity of the mean to the length of the tail of the distribution is demonstrated by the difference between the measures of center in these two hydrographs. Inspection of the two hydrographs in Figure 1 suggests the difficulty of using the date of the first large spring peak. It is especially difficult to visually choose a “peak” in the 2005 hydrograph that represents spring onset. For this reason, we focus on measures of the center of discharge. Moore et al. [2007] explore some other reasons for the preference of the median over timing of peak discharge or CT, including the stronger relationship between total yearly discharge and CT than total yearly discharge and the day of the median. The median seems to record more information that is independent of the total yearly discharge than CT and does not suffer the other drawbacks noted above; we will use the D50 in this analysis. The second issue with site specific analyses involves repeated use of hypothesis tests where each test is subject to errors. Hypothesis testing can be criticized
1124
Mark C. Greenwood, Joel Harper and Johnnie Moore
100
200
0.020
300
0
100
200 2005 Day
iii
iv
0.0
0.4
0.8 0.4 0.0
300
0.8
1951 Day
Cum Prop of Flow
0
Cum Prop of Flow
0.010 0.000
0.006
Prop of Flow
0.012
ii
0.000
Prop of Flow
i
0
100
200
1951 Day
300
0
100
200
300
2005 Day
Figure 1. Gage on the South Fork of the Boise River, 1951 and 2005. Panel i and ii contain hydrographs, iii and iv contain cumulative yearly discharge, with CT displayed as the bold dashed line and D50 as the light dashed line.
merely for its inability to assess practical importance [Armstrong, 2007], the arbitrariness of the choice of significance level, the impacts of choice of direction in tests, and because repeated tests lead to increased probability of spuriously detected significant results. For example, Stewart et al. [2005] consider 294 different tests for significant linear trends in CT. The tests used are 2-sided, meaning that they attempt to detect change either to earlier or later timing, which is a conservative choice if researchers are only looking for earlier timing and that is all that is present. However, it is liberal in that it allows for significant results in either direction, with significant results of later timing providing a strong contradiction to the warming driven earlier timing. Additionally, Stewart et al. [2005] use a 10% significance level for most of their tests, which is generally considered a liberal choice, except as the previous discussion suggests, these 10% 2-sided tests are equivalent to 5% 1-sided tests if all the resulting estimates are in the direction
Statistics in Climate Change
1125
of earlier timing. Of their 294 tests, 105 are reported to be significant based on simple linear regression. This is a higher rate of significant results than would be expected just based on chance, which is 29.4 based on their assumed error rate, number of tests, and assuming that there is no trend present in any location. The problem is then to consider which 30 of the 105 significant results might have been due to chance alone. More generally in the same paper, around 3200 hypothesis tests were performed (the exact number of tests and significant results found is unclear), which is around 10 tests per study site. It is possible to control the overall error rate in testing by using an adjustment for multiple testing, leading to increased belief in the locations where significant changes are detected. The most well known adjustment for multiple testing is the Bonferroni method [1936], but it is known to be a conservative approach and with thousands of tests makes it nearly impossible to find any significant results. Newer approaches generated from the Biostatistics literature would seem to be more useful here, such as Benjamini and Hochberg [1995] and Storey and Tibshirani’s q-value [2003]. These methods are less conservative, providing higher power to detect changes, while still providing useful protection for multiple testing by controlling other measures of error instead of the experiment-wise errors rates that Bonferroni controls. Moore et al. [2007] used the methods of Benjamini and Hochberg [1995] and found significant linear trends for D50 for 8 of 21 locations after adjusting for 126 tests considered in that paper (6 tests per study site). It is difficult to aggregate a set of significant and nonsignificant tests across locations in a region to be able to quantify the overall magnitude of change in the region; the judgement about the magnitude of change is often non-statistical. A further concern in climate change detection involves the assumption of linearity of trends. A more detailed discussion of different methods for trend detection is given below, but this considers the assumption that climate change can and should be measured with linear or even monotonic assessments. While warming of around 0.2 degrees C per decade has been reported in Hansen et al. [2006], that paper highlights the nonlinearity of the global temperature trend by using 5-year running means to describe the trend in observed temperatures. This result alone suggests that linear assessments of climate change are crude approximations to the actual signals that temperature changes might generate and that extending linear trend detection to longer time scales can produce un-representative trend estimates. Over shorter time scales, more interesting local variation will be ignored if all trends are assumed to be linear. The best arguments for linear trend detection methods are that many curves can be reasonably approximated by straight lines locally, the results are simple to interpret, the statistical tests are well known, and that the linear trend could be a component of a more complicated polynomial but the only interest is in that component of the model. Other trend tests are based on monotonic trend detection, but these can also be problematic as local decreases may be found in the presence of overall increases and some methods would fail to detect the overall change. Additionally, these methods typically do not provide estimates of the amount of change, just whether there is evidence of its presence.
1126
Mark C. Greenwood, Joel Harper and Johnnie Moore
We take a different approach, comparing support for models that contain linear or nonlinear components, and then using the most supported model(s) to describe the form and magnitude of any trends. In some, but certainly not all, climate change detection papers, the measurements were assumed to be independent over time. This is always the case when Pearson correlation and simple linear regression tests are used. When the autocorrelation is negligible, those procedures are not impacted and when the autocorrelation is negative, the procedures are actually conservative [Cryer and Chan, 2008]. The latter situation was encountered in Moore et al. [2007] so simple linear regression was used without adjustment there. In general, it is dangerous to use statistical procedures that assume independent measurements on time series without at least assessing the magnitude of the dependency. Similar concerns arise when attempting to aggregate effects spatially. All models considered here incorporate spatial-temporal correlations but tend to demonstrate minimal dependencies on either aspect for D50. A major benefit of attempting to estimate a regional level trend is that the display of the trend can be simplified into a single plot as opposed to plots of significant trends by location. For example, in Stewart et al. [2005], the reader is left to visually assess the size and magnitude of trend tests and somehow visually aggregate the results. Based on the many trends explored in that paper, the International Panel on Climate Change [Lemke et al., 2007] summarizes the observed trends as: Date of maximum mountain SWE [snow water equivalent of the mountain snowpack] appears to have shifted earlier by about two weeks since 1950, as inferred from streamflow measurements [Stewart et al., 2005]. That these reductions are predominantly due to warming is shown by regression analysis of streamflow. [Stewart et al., 2005] This summary result of about a 2 week shift is not accompanied by any measure of precision and is not really a statistical estimate as it is based on a non-statistical aggregation of many observed slope coefficients, some that are significant and some that are not. In Moore et al. [2007], confidence intervals for each question of interest are displayed, which allows assessment of the practical significance of the change at each location but fails to provide a direct regional level generalization. While it is more difficult statistically to estimate regional level trends, it is simpler to then interpret the results, as the average change in the region. 3
DATA
We use, as in [Moore et al., 2007], a set of 21 high elevation streamflow gages. These gages were selected after considerable screening of locations to eliminate locations that have been impacted by irrigation withdrawal and small dams. They are a small subset of the gages used in [Stewart et al., 2005] that are in the Columbia and Missouri Headwaters. A map of the gage locations is provided in Figure 2.
Statistics in Climate Change
1127
These should provide records that are not affected by human modification and are recording primarily snowmelt signals due to their high elevation locations, leading to an analysis that is focused on the change in timing in streamflow records that should correspond to snowmelt.
Figure 2. Map of gage locations, with black dots indicating the locations used. Streamflow timing measures are driven by precipitation, temperature, solar radiation, and basin specific characteristics. Basin specific characteristics are incorporated into our models by allowing for different average timing for each gage through a flexible spatial intercept in the models explained below. Weather is both a local and regional phenomenon. On a regional scale, differences in weather between years can be explained, to some degree, using regional climate forcing functions such as El Ni˜ no/Southern Oscillation (ENSO, [Wolter et al., 1993; 1998]) and Pacific Decadal Oscillation (PDO, [Zhang et al., 1997]). ENSO has a period of around 7 years with positive ENSO values (El Ni˜ no Winters) related to drier winters and warmer springs in the Northwest U.S. This should lead to earlier spring runoff in this region. Positive PDO values are also suggested to lead to drier winters and warmer springs in the Northwest U.S., which should also be associated with earlier spring. The period for PDO is thought to be around 20 to 30 years. It has also been suggested that positive values of both PDO and ENSO in the same year can have a synergistic effect. To address the potential importance of these
1128
Mark C. Greenwood, Joel Harper and Johnnie Moore
two variables, we incorporate yearly averages of each measure individually and together in the models as well as considering an interaction between them to assess that potential synergistic effect. Figure 3 displays both functions with estimated trend lines for easier visualization of the periodicity and trend in each measure. PDO is more correlated with year than ENSO (r = 0.51 vs. 0.36) and the measures are moderately correlated with each other over this time period (r = 0.72). Total yearly discharge (totalQ) at a gage is a function of the precipitation in the basin. Moore et al. [2007] demonstrated the impact of larger total yearly discharge on delaying timing measures both theoretically and using actual measurements. The variability in timing measures were shown to be usefully explained using totalQ but visual diagnostics also suggested that this effect may be nonlinear, with upper and lower thresholds within each gage for the effects of changing amounts of totalQ. These values are transformed first by detrending the results for each gage to remove any collinearity with year. This allows the time trends estimated in the models to contain all the linear change over time. To allow for models that incorporate total discharge across the study area in single effect, the detrended total discharge is also standardized for each gage. Any further references to totalQ are to this detrended, standardized measure. These transformations had minimal impacts on the results but make the trend component in the models interpretable as the change over time even though this makes the totalQ effect more difficult to interpret. Figure 4 displays D50 over time and versus the detrended, standardized totalQ for two gage locations. The modest negative linear time trend is visible along with the indication of a nonlinear relationship between totalQ and D50. 4
TREND DETECTION METHODS
There are a wide variety of techniques that have been developed for detecting trends over time. The simplest and most commonly used involves estimating a linear trend using simple linear regression, which provides a single estimate of the amount of linear change over time and often a hypothesis test is used to assess the evidence against a slope of 0. This technique carries a wide variety of assumptions including a constant linear relationship between the measure of interest and time and errors that are independent, normally distributed and that have constant variance over time (homoskedasticity). Many different approaches are available to retain its simplicity while making the technique more robust to violations of some or all of those assumptions. The linearity assumption can be addressed via nonparametric trends tests such as the Cox-Stuart [1955] trend test or Mann’s [1945] trend test (also called the Mann-Kendall test), both tests assessing the patterns in increases or decreases and assessing how unusual those patterns might be as evidence of a trend. These methods are robust to the linearity assumption and to varying degrees to the other assumptions, although the robustness of these techniques to heteroskedasticity are less clear. These methods do not directly lead to estimates of the amount of change or the potential to consider trends adjusted for other effects. The Mann-Kendall
1129
1 0 −2 −1
ENSO
2
Statistics in Climate Change
1950
1960
1970
1980
1990
2000
1990
2000
0 −2 −1
PDO
1
2
Year
1950
1960
1970
1980 Year
Figure 3. Plots of Water Year averaged PDO [Zhang et al., 1997] and ENSO [Wolter et al., 1993; 1998] effects. procedure has been adapted to consider regional level effects based on multiple sites for change detection in Renard et al. [2008]. Another alternative involves a resistant slope estimate such as in Theil-Sen regression ([Theil, 1950; Sen, 1968], which provide resistance to the effect of some unusual observations. Fully robust techniques based on Huber’s M-estimation [Huber, 1981] provide robustness to the normality assumption by using estimation techniques not based on least squares as in typical regression models. These robust regression models, like the TheilSen resistant methods, provide slope coefficients that have relatively standard interpretations, avoid least squares estimation which corresponds to the normality assumption, and can incorporate other variables in the models. Houseman [2005] has proposed a version of robust regression that incorporates an estimate of autocorrelation between responses providing a technique that is robust to violations of the independence and normality assumptions. While these techniques are robust to unusual observations, they do not directly address non-constant variance which
1130
Mark C. Greenwood, Joel Harper and Johnnie Moore
240 200
220
Day of Median Flow
240 220 200
Day of Median Flow
260
Lochsa River
260
SF Boise River
1950
1970
1990
1950
1970
250 240 220
230
220
230
Day of Median Flow
240
Year
210
Day of Median Flow
Year
1990
−1
0 totalQ
1
2
−2
−1
0
1
2
totalQ
Figure 4. Linear regression effects for two gages. may be present in climate change situations where the variability is also changing over time. Generalized least squares (described in [Pinheiro and Bates, 2003]) techniques can be used to allow the variance to change either as a function of the mean or as a function of time and also can be used to incorporate correlation between responses to potentially address violations of independence and constance variance assumptions. Another option is to relax the linearity assumption for the trend but retain independence, normality and equal variance assumptions, at least initially. This is possible using a nonparametric trend estimate such as those possible using local polynomial regression, kernel smoothing or spline based methods. These methods assume that the trend is varying slowly over time but estimate both the level of smoothness and the trend simultaneously. These models can be compared with either models with a linear trend or no trend, with that comparison providing information about the form of the trend (nonlinear, linear, or not present). Incorporation of additional variables is most easily accomplished using what is called a
Statistics in Climate Change
1131
generalized additive model (GAM, [Hastie and Tibshirani, 1990]) which incorporates nonparametric components. Accommodation of certain types of violations of normality are possible using the generalized component of GAMs where response distributions other than normal are possible. While GAMs are not technically robust to outliers, if a skewed response distribution is needed, then a gamma distribution can be used. Adjustments for lack of independence and heteroskedasticity are possible using generalized least squares modifications of GAMs, leading to a generalized additive mixed models (GAMM, see [Wood, 2006] for a general discussion). Across the different methods, there are different levels at which the analysis can be considered, the site specific analyses that amount to single time series analysis and different levels of regional aggregation, seeking an overall change on a larger scale based on aggregating that information. Much recent research has followed the detection and attribution methods as in [Berliner et al., 2000] or recently in [Barnett et al., 2008]. The detection part of the analysis, whether Bayesian or frequentist in perspective, amounts to assessing evidence of change, often after incorporating an adjustment for temporal and spatial dependency in the measurements. Researchers such as McCabe and Clark [2005] and Stewart et al. [2005] focus on site specific versions of these methods using simple linear regression or Theil-Sen versions of simple linear regression in contrast to Berliner et al. [2000] who perform this type of “detection” analysis on the regional scale, accounting for spatial-temporal correlations. After evidence of change has been established, then attempts are made to attribute the change to anthropogenic factors [Barnett et al., 2008]. We focus on the first step in these types of analyses, using a “detection” type model, focusing on estimation of the trend adjusting for spatial-temporal correlations. We do not consider “attribution” in this work, working on the evidence for and estimation of the trends over time. To do this, three types of models are compared, a model with a nonparametric trend, one with a linear trend and one with no time trend. We also incorporate the effect of totalQ in each model to adjust for the total yearly discharge and search amongst models that incorporate effects of PDO and ENSO to see if the support for the trend remains in presence of those regional forcing functions. This provides further information on the form of the trend adjusted for other factors that are thought to be important and not directly related to warming. Whenever an effect is considered to be nonlinear in our models, a similar model constraining it to be linear is also considered, to allow support for the nonlinearity of that component to be assessed. In contrast to other approaches in this area, our model comparison uses a different approach than methods based on hypothesis testing. This avoids some of the arbitrariness of p-value usage and some methodological problems that have yet to be resolved with hypothesis testing in additive mixed models [Wood, 2006]. Instead, we take an information theoretic, model selection approach using Akaike’s Information Criterion (AIC, [Akaike, 1973]) to compare support for different potential models. The AIC =-2×log(Likelihood)+2p, where p is based on the degrees
1132
Mark C. Greenwood, Joel Harper and Johnnie Moore
of freedom for the model, is an estimator of the expected Kullback-Liebler discrepancy. It provides a method of comparing models on their predictive ability and models with smaller AIC values are considered to be better. Burnham and Anderson [2002] suggest a rule of thumb that models within 2 AIC units should be considered to be similarly supported. They note that the performance of model selection criteria like AIC improve if a set of reasonable candidate models are defined a priori, instead of “data-dredging” in the selection process. Incorporating models in the candidate set with different versions of the trend (none, linear, nonparametric) allows evidence to be compared for each of the forms of the trends in the presence of other variables. Our approach follows this two step process, identifying a set of candidate models and then ranking those models based on their AIC values relative to the top performing model in the candidate set. This represents a departure from the typical “detection” methods that rely on hypothesis testing, but evidential support for different models can be just as convincing. Any statistical procedure needs to be assessed relative to the assumptions contained in it, even robust or nonparametric procedures. Our models assume normally distributed errors and the aspects of the models resemble the real system in terms of the form of the trend and other explanatory variables as well as the correlation structure. These assumptions can be evaluated using residual diagnostics, which were generally reasonable. Residual temporal autocorrelation was negligible, although it was in the original data set as well. The distribution of the residuals did not suggest mis-specification of the variables in the favored model or a particularly skewed or outlier prone response distribution. However, the tails are slightly heavier than expected based on a normal distribution. A relatively symmetric but heavier tailed error distribution does not suggest that there is a problem with the slope estimates but could lead to an underestimate of the standard error of the effects. However, the sample size is rather large with n = 1155, which allows the central limit theorem to provide protection on the overall model inference that assumes normality for the errors. If the residuals demonstrated an extremely skewed distribution with outliers in a single direction, the central limit theorem justification for the results might be more debatable. 5
ADDITIVE MIXED MODELS
Additive mixed models (or semi-parametric mixed models) provide a general framework for estimating potentially nonlinear regression components as well as accommodating spatial-temporal correlations. This description is not explicit as it is often meant to encompass models with random components such as random intercepts and random slopes but also can refer to models that incorporate variancecovariance structures between observations that are more complicated than the identity matrix. That second use of the terminology is employed here. We did consider random intercept and other related models initially but found much better overall performance from the correlated errors additive models discussed here. When considering GAM or GAMM models, there are different ways to con-
Statistics in Climate Change
1133
struct the nonparametric, smooth components. We use penalized natural cubic regression splines, which are piecewise cubic polynomials that are patched together at “knots” with conditions requiring continuous second derivatives at the knots, linearity conditions at the edges, and estimated roughness penalties on the form of the effects. Natural splines are used to reduce edge effects by imposing linearity constraints at the edges of the effect. By using cubic regression splines, we mean that there is some initial smoothing producing via knot choice, with fewer knots than observation locations. This contrasts with “smoothing splines” where each observation location defines a knot [Ramsay and Silverman, 2005]. Initial screening was actually performed with smoothing splines but regression splines were used once a typical range of complexity was identified to improve the convergence of the estimation algorithm. The results were similar using either technique and the cubic regression splines are much faster to estimate. In the estimation of the spline-based component, the knots define an initial restriction on the component, with k knots leading to k-1 degrees of freedom. Using generalized cross-validation (GCV, [Craven and Wahba, 1979]), a smoothing parameter is chosen that can further reduce the roughness in the spline component. Based on the original knots and the estimated smoothing parameter, each additive effect can be described by its effective degrees of freedom (EDF). As the EDF increases, the freedom in that spline effect increases, starting with an EDF of 1 corresponding to a linear effect. It is important to note that when the smoothing parameter is being estimated, it is possible for the estimation algorithm to choose to make the effect linear, thus a parametric entry for that effect in the model is nested in the GCV estimation process. Wood [2006] describes the multiple generalized cross validation methods used to estimate more than one nonparametric component in a GAM. This algorithm differs from the original back-fitting method developed for GAMs in Hastie and Tibshirani [1990]. An additive mixed model as we employ here can be written as yit = α+x1it β1 + . . . + xkit βk + s1 (x(k+1)it ) + s2 (x(k+2)it ) + . . . + vit for i = 1, . . . , 21 locations, t = 1951, . . . , 2005 and v ∼ N (0, σ 2 Λ). s() means that the term is estimated nonparametrically. Many different correlation structures including independence between observations can be specified in Λ. Cressie and Huang [1999] define general classes of separable and non-separable spatial-temporal error structures, which are written as a function of the distance apart in space, d, and time, t. The nonseparable error structures allow an interaction between space and time, meaning that the spatial dependency can change over time (or vice-versa). One such structure is the non-separable exponential where the correlations needed to define Λ are a function of d and t as cor(d, t) = exp(−d/rd − t/rt + λ ∗ d/rd ∗ t/rt ). This structure becomes separable if λ=0. Separable structures multiply the spatial correlation by the temporal correlation, providing the same spatial correlation within all years and the same temporal correlation within all locations. It is simpler to estimate than the non-separable structures which has led to the frequent use of separable structures, as in [Berliner et al., 2000]. Especially when considering the separable spatial-temporal errors, there are a
1134
Mark C. Greenwood, Joel Harper and Johnnie Moore
variety of different candidate models for both the spatial and temporal components in the correlation as any spatial and temporal method can easily be combined. We found the linear spatial correlation structure to be most numerically stable in this application. It is one of the simplest correlation structures with a linear decrease up to the spatial range, Rd , and then correlations of 0 beyond that point, calculated as cor(d) = [1−(d/Rd )]Id 0, the AR(1) error structure is equivalent to the exponential structure discussed above with φ = e−1/rt . However, we routinely encountered negative temporal autocorrelation so this constraint would be problematic in this application. Putting the linear spatial and AR(1) temporal correlation together as a separable structure leads to the linear-AR(1) spatial-temporal structure with cor(d, t) = [1 − (d/Rd )]Id 0, the process checked whether there was an N beN +m yond which k n=N an k < ǫ. The supertask of actually summing the residual partial sums for all numbers m could obviously not be carried out exceptP in some k special cases (such as that of the geometric series). However, if S(k) = n=1 an denotes the partial sums, the usual process was to check whether the partial sums became “constant”, beyond some N .35 Obviously, the partial sums S(k) would never literally become constant, and when successive terms are added, there would always be some change (except when all the terms of the series are zero beyond N , so that the infinite sum reduces to a finite sum). So this “constancy” or “no change” was understood to hold only up to the given level of precision (ǫ) in use. That is, the sum of an infinite series was regarded as meaningful if the partial sums S(k) became constant, after a stage, up to a non-representable (or discardable) quantity: kS(N + m) − S(N )| < ǫ, which is just the criterion stated earlier. What exactly constitutes a “non-representable” or “discardable” quantity (ǫ) is context-dependent, decided by the level of precision required, and there need be no “universal” or mechanical rule for it. Apart from a question of convergence, a key philosophical issue which has gone unnoticed relates to representability. The decimal expansion of a real number, such as π, also corresponds to an infinite series. Regardless of the convergence of this series, it can be written only up to a given number of terms (corresponding to a given level of precision): even writing down the terms in the infinite decimal expansion is a supertask: this is obvious enough when there is no rule to predict what the successive terms would be. So a real number such as π can never be accurately represented; Indian tradition took note of this difficulty from the earliest times, with the ´sulba sutra-s (-500 CE or earlier) using the words36 sEvш q (“with something left out”) or37 sAEn(y (sa + anitya = “impermanent, inexact”), and Buddhism and Science, Nalanda, 2008. Draft at http://ckraju.net/papers/ Zeroism-and-calculus-without-limits.pdf. 33 Aryabhat ¯ ¯ ac¯ arya with the Bh¯ a.sya of N¯ılakan utvan, ed. K. Sambasiva .¯ıya of Aryabhat .¯ . .thasomas¯ Sastri, University of Kerala, Trivandrum, 1930, reprint 1970, commentary on Gan . ita 17, p. 142. 34 C. K. Raju, Cultural Foundations of Mathematics, cited above, chp. 3. 35 ibid. p. 177-78, and e.g., Kriy¯ akramakar¯ı, cited above, p. 386. 36 Baudh¯ ´ ayana ´sulba s¯ utra, 2.12. S. N. Sen and A. K. Bag, The Sulbas¯ utras of Baudh¯ ayana, Apastamba, K¯ aty¯ ayana, and M¯ anava, INSA, New Delhi, 1983, p. 169. 37 Apastamba ´ sulba s¯ utra, 3.2. Sen and Bag, cited above, p. 103. The same thing is repeated in other ´sulba s¯ utra-s.
1188
C. K. Raju
early Jain works (such as the S¯ urya praj˜ n¯ apati, s¯ utra 20) also √ use the term ki˜ ncid th ¯ a vi´ses¯ adhika (“a little excess”) in describing the value of π and 2. Aryabhat . (5 38 c. CE) used the word aAsnn (near value), which term is nicely explained by N¯ılakan.t.ha in his commentary,39 essentially saying that the “real value” (vA-tvF\ s\HyA) cannot be given. Taking cognizance of this element of non-representability fundamentally changes arithmetic. This happens, for example, in present-day computer arithmetic, where one is forced to take into account this element of non-representability, for only a finite set of numbers can be represented on a computer. Consequently, even integer arithmetic on a computer can never obey the rules of Peano’s arithmetic. In the case of real numbers, or floating point computer arithmetic, of course, a mechanical rule is indeed set up for rounding (for instance in the IEEE floating point standard 754 of 1986), and this means that addition in floating point arithmetic is not an associative operation,40 so that floating point arithmetic would never agree with the arithmetic according to any standard formal algebraic structure such as a ring, integral domain, field etc. In Indian tradition, this difficulty of representation connects to a much deeper philosophy of ´su ¯nyav¯ ada. On the Buddhist account of the world, the world evolves according to “conditioned coorigination”. (A precise quantitative account of what this phrase means to me, and how this relates to current physics is a bit technical, and is available in the literature for those interested in it.41 ) The key point is that there is genuine novelty of the sort that would surprise even God, if he existed. There is no rigid linkage (no Newton’s “laws”) between present and past, the present is not implicit in the past (and cannot be calculated from knowledge of the past, even by Laplace’s demon). Accordingly, there is genuine change; nothing stays constant. But how does one represent a non-constant, continually changing entity? Note that, on Buddhist thought, this problem applies to any entity, for Buddhists believe nothing real can exist unchanged or constant for two instants, so there is no constant entity whatsoever which is permanent or persists unchanged. This creates a difficulty even with the most common utterances, such as the statement “when I was a boy”, for I have changed since I was a boy, and now have a different size, gray hair etc. The linguistic representation however suggests that underlying these changes, there is something constant, the “I” to which these changes happen. Buddhists, however, denied the existence of any constant, unchanging essence or soul for it was neither empirically manifest, nor could it be inferred: the boy and I are really two different individuals with some common 38 Ganita
10, trans. K. S. Shukla cited earlier. bh¯ a.sya, commentary on Gan . . ita 10, ed. Sambasiva Sastry, cited earlier, p. 56. 40 For an example of how this happens, see C. K. Raju, “Computers, mathematics education, and the alternative epistemology of the calculus in the Yuktibhasa”, Philosophy East and West, 51(3) 2001, pp. 325–62. 41 The idea is to use functional differential equations of mixed-type to represent physical time evolution. This leads to spontaneity. See C. K. Raju, Time: Towards a Consistent Theory, Kluwer Academic, 1994, chp. 5b. Also, C. K. Raju, “Time Travel and the Reality of Spontaneity”, Foundations of Physics 36 (2006) pp. 1099–1113. .
39 Aryabhat ¯ ¯ıya
Probability in Ancient India
1189
memories. However, while Buddhists accept the reality of impermanence, there is a practical problem of representation in giving a unique name to each individual at each instant. Consider, for example, Ashoka. No one, not even the Buddhists, describe him as Ashoka1, Ashoka2, and so on, with one number for each instant of his life, which cumbersome nomenclature would require some billion different names even on the gross measure of one second as an atomic instant of time. Therefore, for practical purposes, Buddhists recognize the paucity of names, and still use a single name to represent a whole procession of individuals. This “constancy” of the representation is implicitly understood in the same sense as the constancy of the partial sums of an infinite series: namely, one neglects some small differences as irrelevant to the context. That is, on the Buddhist view of constant change, the customary representation of an individual, used in everyday parlance, as in the statement “when I was a boy”, can be obtained only by neglecting the changes involved (my size, my gray hair, etc) as inconsequential or irrelevant in the context, and which changes are hence discarded as “non-representable” (for the practical purpose of mundane conversation, in natural language). So, from the Buddhist perspective of impermanence, mundane linguistic usage necessarily involves such neglect of “inconsequential” things, no matter what one wants to talk about. Note the contrast from the idealistic Platonic and Neoplatonic belief. Plato and Neoplatonists believed in the existence of ideal and unchanging or eternal and constant entities (soul, mathematical truths). Within this idealistic frame, mundane linguistic usage (as in the statement “when I was a boy”) admits a simple justification in straightforward sense that change happens to some underlying constant or ideal entity. But this possibility is not available within Buddhism, which regards such underlying ideal entities as fictitious and erroneous, and can, therefore, only speak about non-constant entities, as if they were constant. The dot on the piece of paper is all we have, it is the idealization of a geometric point which is erroneous. (Apart from the idealist position, the formalist perspective of set theory also fails, for Buddhist logic is not two valued. But I have dealt with this matter in detail, elsewhere, and we will see this in more detail below.) Thus, ´sunyav¯ ada or zeroism provides a new way to get over the “fear of small numbers”. It was, I believe, Borel who raised the question of the meaning of small numbers such as 10−200 . On the ´su ¯nyav¯ ada perspective, we can discard such numbers as practically convenient. (We have nothing better, no “ideal” or “perfect” way of doing things.) We are not obliged to give a general or universal rule for this, though we can adopt convenient practices. What this amounts to is a realist and fallibilist position. All knowledge (including mathematical knowledge) is fallible.42 Therefore, when given an exceesively 42 If mathematical proof is treated as fallible, the criterion of falsifiability would need modification. When a theory fails a test, it is no longer clear what has been refuted: (a) the hypothesis or (b) the deduction connecting hypothesis to consequences. C. K. Raju, “Proofs and refutations in mathematics and physics”, in: History and Philosophy of Science, ed. P. K. Sen, PHISPC (to appear).
1190
C. K. Raju
small number we may discard it, as in customary practice, or in computer arithmetic. (Unlike computer arithimetic, where one requires a rule, with human arithmetic, we can allow the “excessive smallness” of the number to be determined by the context.) It is possible, that this leads to a wrong decision. If enough evidence accumulates to the contrary, we revise our decision. It is the search for immutable and eternal truths that has to be abandoned. Such eternal truths are appropriate to religion not any kind of science. Thus the traditional Indian understanding of mathematics, using zeroism, dispenses with the need for convergence, limits, or supertasks, and rehabilitates the frequentist interpretation of probability, in the sense that it provides a fresh answer to a long-standing philosophical difficulty in the Western tradition.
SUBJECTIVE PROBABILITIES AND THE UNDERLYING LOGIC OF SENTENCES
Probabilities of singular events Of course, there are other problems with the frequentist interpretation: for example, it does not apply to single events, for which one might want to speak of probability. The classic example is that of a single footprint on a deserted beach (or the origin of life). There is some probability, of course, that someone came in a helicopter and left that single footprint just to mystify philosophers. But, normally, one would regard it as a natural phenomenon and seek a natural explanation for it. In this context there is an amusing account from Indian Lok¯ayata tradition, which is the counterpart of the Epicurean perspective in Greek tradition. Here, a man seeking to convert his girlfriend to his philosophical perspective, goes about at night carrying a pair of wolf’s paws. He makes footprints with these paws. His aim is to demonstrate the fallibility of inference. He argues that by looking at the footprints, learned people will infer that a wolf was around, and they will be wrong. (We recall that the Lok¯ayata believed that the only reliable principle of proof was the empirically manifest.) More seriously, such singular events pose a serious problem today in quantum mechanics, where the “probability interpretation of the wave function” is called into play to explain interference of probabilities exhibited by single objects. A typical illustration of such interference is the two-slit diffraction pattern that is observed even when it is practically assured that electrons are passing through the slit one at a time. Understanding the nature of quantum probabilities has become a major philosophical problem, and we describe below some attempts that have been made to understand this problem by connecting it to philosophies and logics prevalent in ancient Indian tradition.
Probability in Ancient India
1191
Quantum mechanics, Boolean algebra, and the logic of propositions In the 1950’s there was a novel attempt to connect the foundations of probability theory to the Jain logic of sy¯ adav¯ ada, by three influential academicians from India: P. C. Mahalanobis,43 founder of the Indian Statistical Institute, J. B. S. Haldane,44 who had moved to that institution, and D. S. Kothari,45 Chairman of the University Grants Commission. Subsequently, the quasi truth-functional logic used in the structured-time interpretation of quantum mechanics46 was connected to Buddhist logic.47 To understand these attempts, first of all, let us connect them to the more common (Kolmogorov) understanding of probability as a positive measure of total mass 1 defined on a Boolean σ-algebra (usually of Borel sets of a topological space). The common definition typically requires set theory, as we saw above, to facilitate the various supertasks that are required, whether for the construction of formal real numbers as Dedekind cuts, or as equivalence classes of Cauchy sequences, or for the notion of convergence required by the law of large numbers. However, from a philosophical perspective it is more convenient to use statements instead of sets (though the two are obviously interconnected). Thus, instead of defining probabilities over measurable sets, it is more natural to define probabilities over a Boolean algebra of statements. This, incidentally, suits the subjectivist interpretation, for the probability of a statement could then be taken to indicate the degree of (general) subjective belief in that statement (or the objective propensity of that statement to be true, whatever that means). The immediate question, however, is that of the algebraic structure formed by these statements. First of all, we can set aside the specifically σ-algebra aspect, for we have already dealt with the notion of convergence and supertasks above. For the purposes of this section we will focus on the Boolean algebra part. Why should probability be defined over a Boolean algebra? The answer is obviously that if we have a 2-valued logic of sentences, then a Boolean algebra is what we naturally get from the usual notion of “and”, “or”, “not”, which are used to define the respective set-theoretic operations of intersection, union and complementation. What is not obvious is why these “usual” notions should be used, or why logic should be 2-valued. Quantum mechanics (and especially the problem of the probabilities of singular events in it) provides a specific empirical reason to call the Boolean algebra into question. With probabilities defined on a Boolean algebra, joint distributions of 43 P. C. Mahalanobis, ‘The Foundations of Statistics (A Study in Jaina Logic)’, Dialectica 8, 1954, pp. 95–111; reproduced in Sankhya, Indian Journal of Statistics, 18, 1957, pp. 183–94. 44 J. B. S. Haldane, ‘The Sy¯ adav¯ ada system of Predication’, Sankhya, Indian Journal of Statistics, 18, 1957, pp. 195–200. 45 D. S. Kothari, ‘Modern Physics and Sy¯ adav¯ ada’, Appendix IV D in in Formation of the Theoretical Fundamentals of Natural Science vol. 2 of History of Science and Technology in Ancient India, by D. P. Chattopadhyaya, Firma KLM, Calcutta, 1991, pp. 441–48. 46 C. K. Raju, Time: Towards a Consistent Theory, cited above, chp. 6B, “Quantum Mechanical Time”. 47 C. K. Raju, The Eleven Pictures of Time, Sage, 2003
1192
C. K. Raju
random variable are assured to exist. This is, however, known to not happen in quantum mechanics. (We will not go into details, since our primary concern here is with Indian tradition, and not quantum mechanics. However, this author has explained the detailed relation to quantum mechanics elsewhere, at both a technical48 and a non-technical level.49 ) The Hilbert space formulation of quantum mechanics starts with the premise that quantum probabilities cannot be defined on a Boolean algebra, since joint distributions do not exist. The appropriate algebraic structure is taken to be that of the lattice of subspaces (or projections) of a Hilbert space (although there are numerous other opinions about what the exact algebraic structure ought to be). The usual definition of a random variable as a measurable function actually requires only the inverse function, which is a homomorphism which preserves the algebraic structure. In the Hilbert space context, this definition of a random variable as a homomorphism (on a lattice, not an algebra) naturally leads one to identify random variables with projection-valued measures (spectral measures). By the spectral theorem, such measures correspond to densely-defined, self-adjoint operators in this Hilbert space. Since the lattice of projections is non-distributive, these random variables do not admit joint distributions. This corresponds to the more common assertion (“uncertainty principle”) that dynamical (random) variables (self-adjoint operators) which do not commute cannot be simultaneously measured. To return to the question of logic, unlike in India, where different types of logic have been in existence for over 2500 years, from pre-Buddhist times,50 the West took cognizance of the existence of logics that are not 2-valued, only from the 1930’s onwards, starting with Lukaciewicz who proposed a 3-valued logic, where the truth values could be interpreted as “true”, “false”, and “indeterminate”. Could such a 3-valued logic account for quantum probabilities? This question was first investigated by Reichenbach, in an unsuccessful interpretation of quantum mechanics. The 3 Indian academics mentioned above also interpreted the Jain logic of sy¯ adav¯ ada (perhaps-ism) as a 3-valued logic (Haldane), and explored 3-valued logic as a philosophical basis for formulating probabilities (Mahalanobis), and interpreting quantum mechanics (Kothari). Haldane’s idea related to perception. With repeated experiments, something on the threshold of perception (such as a sound) may be perceptible sometimes, and sometimes not. In such cases, the “indeterminate” truth value should be assigned to the statement that the “something” is perceptible. Mahalanobis’ idea was that this third truth value was already a rudimentary kind of probability, for it expressed the notion of “perhaps”. Kothari’s idea was to try and explain quantum mechanics on that basis (though he overlooks Reichenbach’s earlier unsuccessful attempt). 48 C. K. Raju, Time: Towards a Consistent Theory, cited above, chp. 6B, “Quantum Mechanical Time”. 49 C. K. Raju, The Eleven Pictures of Time, Sage, 2003. 50 B. M. Baruah, A History of Pre-Buddhistic Indian Philosophy, cited above.
Probability in Ancient India
1193
Buddhist and quasi truth-functional logic While Haldane’s interpretation is clear enough within itself, it is not clear that it accurately captures the logic of sy¯ adava¯ ada. Thus, the Jain tradition grew in the vicinity of the Buddhist tradition (the Buddha and Mahavira were contemporaries). However, Buddhist logic is not 3-valued. For example, in the D¯ıgha Nik¯ aya, the Buddha asserts the existence of 4 alternatives (catus.kot.i ): (1) The world is finite; (2) the world is not finite (= infinite); (3) the world is both finite and infinite; and (4) the world is neither finite nor infinite.51 This logic of 4-alternatives does not readily fit into a multi-valued truth-functional framework. Especially, the third alternative, which is of the form A ∧ ¬A, is a contradiction within 2-valued logic, and difficult to understand even within the frame of 3-valued logic, where it cannot ever be “true”’. The reason why 3-valued logic is not appropriate for quantum probabilities is roughly this: in the case of the two-slit experiment, what is being asserted is that it is true that the electron passed through both slit A and slit B, and not that in reality it passed through only one slit, but we do not know which slit it passed through. What is being asserted is that we know that Schr¨odinger’s cat is both alive and dead, as in the third alternative above, and not that it is either alive or dead, but we do not know which is the case. However, the 3rd alternative of the Buddhist logic of 4 alternatives (catus.kot.i ) makes perfect sense with a quasi truth-functional logic. The standard semantics here uses the Tarski-Wittgenstein notion of logical “world”, as “all that is the case”. On this “possible-world semantics” one assigns truth values (either true or false) to all atomic statements: such an assignment of truth values represents the possible facts of the world (at one instant of time), or a “possible world”. This enables the interpretation of modal notions such as possibility and necessity: a statement is “possible” if it is true in some possible worlds, and “necessary” if it is true in all possible worlds (tautology). In fact, Haldane appeals to precisely this sort of semantics, in his interpretation of Jain logic, except that his “worlds” are chronologically sequential. Thus, A is true at one instant of time, while not-A is true at another instant of time—there is nothing paradoxical about a cat which is alive now, and dead a while later. However, this, as we have observed, is not appropriate to model the situation depicted by quantum mechanics. With quantum mechanics what we require are multiple logical worlds attached to a single instant of time. Parallel computing provides a simple and concrete desktop model of this situation, with each processor represented by a separate logical world. The meaningfulness of a quasi truth-functional logic is readily grasped in this situation where multiple (logical) worlds are chronologically simultaneous and not sequential, for this allows a statement to be simultaneously both true and false. That is, with multiple (2-valued) logical worlds attached to a single instant 51 Brahmaj¯ ala sutta of the D¯ıgha Nik¯ aya. (Hindi trans.) Rahul S¯ ankrity¯ ayana and Jagdish K¯ ashyapa. Delhi: Parammitra Prakashan, 2000, pp. 8–9; (English trans.) Maurice Walshe, Boston: Wisdom Publication, 1995, pp. 80–81. For a more detailed exposition, see C. K. Raju, “Logic”, in Encyclopedia of Non-Western Science, Technology and Medicine, Springer, 2008.
1194
C. K. Raju
of time, it is meaningful to say that A is true in one world while ¬A is simultaneously true in another. So, a statement may be simultaneously both true and false, without trivializing the theory, or making it inconsistent. From our immediate perspective, the important thing is this: such a quasi truth-functional logic leads on the one hand to an algebraic structure appropriate to quantum probabilities, which structure is not a Boolean algebra.52 On the other hand, Buddhist logic (catus.kot.i ) naturally admits an interpretation as a quasi truth-functional logic. Thus, Buddhist logic (understood as quasi truth-functional) leads to just the sort of probabilities that seem to be required by quantum mechanics. This quasi truth-functional logic, corresponding to simultaneous multiple worlds, is not a mere artificial and post-facto construct imposed on either quantum mechanics or Buddhist thought. From the viewpoint of physics, quasi truth-functional logic arises naturally by considering the nature of time. This is best understand through history. Hoping to make “rigorous” the imported calculus, and the notion of derivative with respect to time required for his “laws”, Newton made time metaphysical (“absolute, true, and mathematical time” which “flows equably without relation to anything external ”53 ). Eventually, this intrusion of metaphysics and religious belief into physics had to be eliminated, from physics, through a revised physical definition of the measure of time; that directly led to the special theory of relativity.54 A correct mathematical understanding of relativity, shows that physical time evolution must be described by functional differential equations (and not ordinary differential equations or partial differential equations). The further elimination of the theological understanding of causality in physics makes these functional differential equations of mixed-type. The resulting picture of physical time evolution55 is remarkably similar to the core Buddhist notion of “conditioned coorigination”: where the future is conditioned by the past, but not decided by it. There is genuine novelty. Thus, the relation of the quasi truth-functional logic to the revised notion of time, in physics, parallels the relation of Buddhist logic to the Buddhist notion of “conditioned coorigination” (paticca samupp¯ ada). Note that this last notion differs from the common notion of “causality” used in Western thought, with which it is commonly confounded. Of course, formal Western mathematics (and indeed much of Western philosophy) is likely to be a long-term casualty of any departure from 2-valued logic. In fact, the very idea that logic (or the basis of probability) is not culturally universal, and may not be empirically certain, unsettles a large segment of Western thought, 52 See
the main theorem in C. K. Raju “Quantum mechanical Time”, cited above. Newton, The Mathematical Principles of Natural Philosophy, A. Motte’s translation revised by Florian Cajori, Encyclopedia Britannica, Chicago, 1996, p. 8. 54 For an exposition of Poincar´ e’s philosophical analysis of the notion of time which led to the special theory of relativity, see C. K. Raju, “Einstein’s time”, Physics Education (India), 8(4) (1992) pp. 293–305. A proper clock was defined by postulating the velocity of light to be a constant. This had nothing to do with any experiment. See, also, C. K. Raju, “The MichelsonMorley experiment”, Physics Education (India) 8(3) (1991) pp. 193–200. 55 C. K. Raju, “Time travel and the reality of spontaneity”, Found. Phys. 36 (2006) pp. 1099– 1113. 53 I.
Probability in Ancient India
and its traditional beliefs about induction and deduction.
1195
This page intentionally left blank
INDEX
1-, 2-, n-random, see randomness A-recursion, 461 a posteriori, 901 abbreviation, 955 Abhandlunguen de Fries’schen Schule, 1159n Abramovich, F., 602 absolutely normal, 658 acceptability, 474 acceptance rules, 475 acceptance-based inference, 474, 475, 484 accommodates, 984 accuracy, 987, 991–993, 997 accurate prediction, 991, 997 Achinstein, P., 420 acronym, 955 action, 268, 269 acyclic linear causal model, 1011n ad hoc, 984 ad hoc hypothesis, 395 ad hoc grammar, 868 Adams’ thesis, 127 Adams’s principle, 465 Adams, E. W., 127, 128, 131, 465 addition rule, 80 additive mixed models, 1131, 1132, 1138, 1142 additivity, 406 adjunction, 1108 adjustable, 989 adjustable parameters, 984, 988 adjusted, 989 admissibility, 292 admissible, 1008 advantage, 299 advantage of rejecting, 299
AFL, 921, 922 AIC, 2, 3, 16, 17, 18, 24, 25, 43, 45, 110, 426, 535, 541, 545, 583, 584–586, 588–590, 592–594, 598–600, 602, 603, 889, 896, 932–934, 946, 994, 996, 1131, 1132, 1135–1138 Airy, G., 1162n Akaike, H., 535, 541, 583, 592, 593 Albert, J. H., 481 algorithm, 645, 645n, 661–663, 665, 667, 676–677, 681, 684, 687, 690, 706, 708 algorithmic complexity, see complexity information theory, 648n probability, 648n, 676, 683, 688 randomness, see randomness algorithmic complexity, 912, 913 algorithmic information theory, 902, 903, 910, 938, 946, 947, 955 algorithmic probability, 912, 939 algorithmic randomness, 34, 38 alien intelligence, 958 Allais paradox, 734 almost sure convergence result, 112 alternative hypothesis, 87 amount of information, 270, 284, 285 analogical method for many properties, 481 analogical methods, 479 analogy, 762 analogy by proximity, 481 analogy by similairty, 479 analytical hierarchy, 662, 696 ancient India, 43 ancillarity, 819 ANOVA, 588
1198 anti-realist attitude, 986 approximation to the truth, 484 arbitrary functions method of, 1169 argument from coincidence, 163 arithmetic coding, 876 arithmetical hierarchy, 662, 694, 707 Armendt, B., 123 Arntzenius, F., 122, 127 Ars conjectandi, 1151, 1152, 1157n, 1170 artificial intelligence, 901 ¯ Aryabhat . a, 1179 assertabilities, 127 association, 816 association rules, 1112 associationist psychology, 1151n assumptions, 1128–1130 asymptotic approximation, 282, 283 asymptotic behaviour, 281 asymptotic normality of maximum likelihood estimators, 1048, 1062 attribution, 1131, 1139 Austin, J. L., 1149, 1169, 1170 Australian Football League (AFL), 921, 922 auto-correlation, 1126, 1134, 1135, 1138 autonomous, 119 average error rate, 506 axiom of regularity, 342 axiom of sufficiency, 564 axiomatic basis, 268 axiomatic foundations, 264 axiomatic system, 263 axioms, 269 Bacchus, F., 122 background probability, 145, 146, 148 Bacon, F., 392 Baird, D., 4, 5, 45 ban, 950 Bandyopadhyay, P. S., 32, 44, 92 Bartlett, P., 603 Barwise, J., 461
basic intervals, 649, 651, 667, 668, 673, 691, 693, 704 basic Ockham efficiency theorem, 1006 Bayes estimator, 295 Bayes factor(s), 301, 425, 483, 493n, 584, 594, 607, 608, 612–614, 933, 988 Bayes factor argument, 989, 1000 Bayes factor explanation, 1007 Bayes factor model selection, 889 Bayes’ Information Criterion, see BIC Bayes’ rule, 59, 87, 583, 598, 599, 719, 721 Bayes’ theorem, 103, 139, 234, 255, 272–275, 277, 281, 288, 334– 335, 345, 352–359, 363, 366, 371, 373, 375, 378, 395, 415, 416, 498, 537, 572, 599, 715, 726, 901, 929, 1050 Bayes, T., 392, 448, 455, 752, 770, 772 Bayesian, 14, 25, 263, 272, 500, 541, 557, 561, 572–574, 576, 577, 901, 913, 987, 990, 1027, 1134, 1135, see also Bayes’ theorem inference, 756, 766 logic, 755 statistical inference, 765, 768 statistics, 764, 767 Bayesian agents, 1016 Bayesian approach, 87, 264, 473, 475 Bayesian bias, 945, 946 Bayesian central limit theorem, 1028, 1051, 1052, 1063 Bayesian confirmation, 6 Bayesian confirmation theory, 5, 8, 333–387, 391 Bayesian credence, 1017 Bayesian credible regions, 293 Bayesian epistemology, 307 objective, 307 Bayesian expansion, 1020 Bayesian experimental design, 1020
1199 Bayesian explanation, 989, 991 Bayesian explanation of Ockham’s razor, 988 Bayesian inductive logic, 473, 475 Bayesian inference, 493n, 895 Bayesian MML, 930 Bayesian model average, 603 Bayesian model selection, 25 Bayesian net, 327, 909, 927, 928, 964 Bayesian network(s), 109, 909 Bayesian paradigm, 264, 265, 272, 273, 283, 302 Bayesian posterior, 932 Bayesian posterior probabilities, 517 Bayesian prior(s), 920, 921, 930, 942, 944 Bayesian statistics, 3, 8, 9, 256, 473, 475, 485, 988 Bayesian universal distribution, 879 Bayesian updating, 217, 220 Bayesianism, 2, 5, 8, 11, 45, 103, 217, 233, 234, 307, 558, 715, 716, 913, 948, 949 Bayesians, 510, 535, 635 Bayesian, see degree(s) of belief beg the question, 994 beg the question in favor of simplicity, 990 behavioristic rationale in testing, 157 belief, 266, 514 degree of, 334, 363, 366, 368, 370, see also degree of belief formation, 248 function, 344–345, 365–368, 370– 371 strength, 335, 344–345, 363, 365– 368, 370–371 belief-control, 252 bell-shaped curve, 76 Bennett, D., 34, 38, 45 Bennett, J. G., 92 Berger, J. O., 602 Berkeley, G., 46 Berkson, J., 826, 827
Berksonian bias, 827 Bernardo’s objective Bayesianism, 10 Bernardo, J. M., 9, 45, 583 Bernoulli distribution, 1060 model, 879–882 random variable, 1030, 1035, 1037 Bernoulli trial(s), 80, 271, 286, 1037 Bernoulli, D., 43, 266, 275, 1149, 1152 Bernoulli, J., 441, 447, 1150–1152, 1154, 1157, 1170 Berry paradox, 676, 681, 682 Berry-Esseen theorem, 1030, 1035, 1038, 1052 Bertrand’s box paradox, 255, 713, 714, 716, 718, 719, 722 Bertrand’s paradox, 449, 953 Bertrand, J., 43, 449–451, 1164, 1165, 1168 beta distribution, 275, 1044, 1046, 1051 beta function, 1052 betting argument(s), 1010, 1013 betting quotient, 245 betting strategy capital, see martingales gambling, 646, 654–656, 669–671, 671n, 672, 673, 698, 699, 701, 702, 706 non-montotonic, 701–702 Bhaskara II, 1176, 1177 bias, 546, 825, 990, 992, 993, 1103 bias-variance trade-off, 992 biased design, 503 biasing path, 824, 825 biasing paths, 826 BIC, 24, 25, 110, 425, 536, 583, 584, 586–589, 597–601, 603, 617, 889, 939, 940 BIL, 484 Billingsley, P., 115 binary lambda calculus, 681 binomial distribution, 71, 79, 80, 82, 200, 202, 203, 212, 216, 275,
1200 282, 286, 297, 1030, 1034– 1037, 1051 binomial parameter, 1051 binomial random variable, 1035 bionic eye, 965 Birkhoff Ergodic Theorem, 661, 703– 704 Blackwell, D., 117, 483 bleen, 955 Blume, J., 11, 12, 14, 19, 45 Bode’s law, 1155n Boik, R., 38 Bolzano, B., 444–449, 451, 453, 454, 456, 458, 459 Bonferroni adjustment, 506 Bonferroni method, 1107 Boole, G., 1160, 1171 Boolean algebra, 459, 460 boosting, 950–952 boosting priors, 951 bootstrap, 195 Borel normality, 656–660, 672, 690– 692, 704, 706 Borel’s strong law, 641, 648, 655–657, 659, 661, 669, 672, 673, 706 of large numbers, 35 ´ 35, 111, 645n, 656, 658, 1167, Borel, E, 1171 Borel-Cantelli lemma, 674 borrow strength, 1075, 1087 Boulton, D. M., 909, 936 boundary rule, 940 breaking strength, 268 Breiman, L., 603 Brier scores, 248 BS, 478, 481, 485 Buffon, G.-L. L., Comte de, 1151n Burckhardt, J., 1168n, 1169 bus number problem, 932, 933, 947 butter, 1102 c† , 454, 455 C. S. Peirce on relative frequency, 155 cable guy paradox, 731
Calas case, 1153 Calcul de probabilit´es, 1164 calculus Descartes’ idiosyncratic reaction, 1184 calibration, 247, 249–251, 304, 322– 325 Calude, C. S., 655 cancer, 1077, 1080 Cantelli, F., 656 Cantor space, 648–649, 651–653, 667, 691, 703, 705 Cardano, G., 634, 1150 cardiac modelling, 966 Carnap Q-predicates, 454 Carnap, R., 8, 99, 118, 307, 315, 335, 364, 370, 396, 441, 453, 455– 459, 463, 473, 476–481, 481n, 751, 752, 756, 761–764, 772, 1150 Carnapian logic, 766 Cartwright, N., 108, 223 Casella, G., 583 catch-up phenomenon, 897 catus.kot.i, 1193, 1194 causal, 958, 959 conclusions, 998 decision theory, 142, 240 diagrams, 822, 828 discovery, 997, 998, 1000 flips, 1012 identification, 817 inference, 25, 26, 28, 813, 816, 820 laws, 816 Markov condition, 109 models, 813 networks, 1012 pathways, 824 systems, 822 theories, 997, 1012 causality, 813, 819, 959, 960 causation, 218, 813, 814 cause and effect, 815
1201 cautious monotony, 310 center of mass, 1123 central limit theorem (CLT), 38, 64, 79, 80, 82, 86, 1027–1029, 1039, 1045, 1061, 1132 chain rule, 881 chaining inferences, 317 Chaitin’s omega (Ω), 691–692, 694– 696 Chaitin, G. J., 36, 635, 645n, 647, 675, 683, 896, 909, 912 Chakrabarti, A., 24, 25, 45, 592–594 Chalmers, D., 127 Champernowne sequence; Champernowne number, 657–660 chance, 139–142, 146 as species of subjective probability, 1169 characteristic function, 1060, 1062, 1065 Charnigo, R., 38, 41, 46 Cherry, S., 32, 44 Chervonenkis, A., 946 Chevalier de M´er´e, 441 χ2 test, 202, 203, 205, 207 χ2 distribution, 1054 Chinese room, 956 Chomsky, N., 866 choosing, 989 Christensen, D., 123 Church’s thesis, see Church-Turing thesis Church, A., 35, 36, 645n, 663, 672 Church-Kleene, 462n Church-Turing thesis, 663, 690, 706– 708 circular, 1021 arguments, 1000 Clasekens, G., 603 classical interpretation, 106 of probability, 713 classical statisticians, 991, 997 classical statistics, 3, 256 classical/error statistics, 45 classification graphs, 964
classification tree, 946, 950, 964 climate change, 1121, 1122 climate science, 41 clustering, 965 code, 865 book, 937, 938 length function, 875 words, 874 codes and probability distributions, 869 coding prior, 934, 938, 939 coding schemes, 947, 949, 950 cognitive decision theory, 475 cognitive utility, 475, 484 coherence, 274, 303, 464 coherent distributions, 1016 Coletti, G., 443, 466 collider, 824 bias, 827 common cause principle, 108 communal intelligence, 958 comparisons, 1009 compatible, 989 competing hypotheses, 553 complement of any event, 55 complement rule, 56 complete code, 875 complete coding, 949 complete ignorance, 408 completing the code, 949 complex hypotheses, 560 complex theory, 986, 989, 990, 1008 complexity, 585, 635, 637, 866, 994, 988, 996 algorithmic, 645, 679, 680n, 681– 683, 706 -equivalent programs, 679 -finite programs, 677–678 infinite, 894 Kolmogorov, 647, 648, 673n, 675– 688 oscillation, 683 plain algorithmic (C), 677, 679– 681, 683–685, 690, 696
1202 invariance theorem for, 679 optimal program, 679, 682, 685 universal, 679 prefix-free (K), 679, 683–687 invariance theorem for, 685 optimal function, 685–686, 691, 694 universal, 685 component risk, 1091 composite hypotheses, 523 composite likelihood, 519 compressibility, compressible, 646, 647, 677–680, 686–689 b-compressible, 1-compressible, 678, 680, 686 compression factor, 677, 678 compression programs, 687 decompression, 676 compression, 865, 901, 902, 957 computability, computable, 645, 648, 665–667, 672, 690 for strings, 667 real numbers, 653, 692, 698 relations, 694 computable hypotheses, 873 computably enumerable (c.e.), 665, 667–668, 682, 690, 694, 700 martingale, see martingales relative, 697 computably random, see randomness, computable computatinal techiques, 985 computer simulation, 997 Comte, A., 1153, 1154 conclusion conjunction, 310 conditional dependence, 1003 conditional distribution, 1027, 1043, 1058 sampling, 1044 conditional empirical complexities, 1004 conditional excluded middle, 128 conditional expectation, 105 conditional frequentist, 564 conditional independence, 105
conditional likelihood, 519 conditional measures of uncertainty, 266 conditional probability, 28, 29, 57– 59, 99, 143, 240, 967 as a random variable, 115 as primitive, 118 axioms, 338 conditionality, 819 conditionalization, 104, 121, 130, 236, 243, 252, 255, 987 conditioning, 563, 826 conditions with probability zero, 111 Condorcet, M. J. A. N. de, 1149, 1151n, 1153, 1154, 1164, 1171 confidence interval(s), 1027, 1035, 1037– 1039, 1041, 1043, 1055, 1155 estimation, 82 procedure, 86 vs. severity reasoning, 181 confidence regions, 304 confidence set, 1092–1094 confirmation, 11, 109, 474, 483, 484, 517, 984 confirmation theory, Bayesian, 333– 387 confirmational inference, 474, 475, 484 conflict, 313 confounders, 825 confounding, 817, 824 path, 825, 826 conglomerability, 125, 143 conglomerative property, 468 conjecture, 945 Connor, R. J., 478 consequence, 269 conservation law, 996 conservation theory, 1003 consistency, 292 maintenance procedure, 319 constrained sample space, 57 constrained theory, 993 constraint graph, 326 context-free grammar, 866
1203 contingency tables, 481 continuity correction, 1035, 1038 continuous random variable, 72, 75 continuum of inductive methods, 457n controlled experiments, 998 conventional hypothesis testing, 299 convergence and supertasks, 1185 Bayesian, 336, 357, 362, 372–373 in distribution, 1029, 1058 in probability, 1013, 1029, 1058 in the limit, 985, 1021 to the truth, 1000, 1003n, 1004, 1005 in the limit, 986 convex, 1082 Copeland-Erd¨ os number, 658 Copernicus, 983, 984 core second-order EP, 318 correlation coefficients, 206 corroboration, 483 Costantini, D., 480, 481n countable additivity, 340–341, 412 countable admissible set, 461 countable infinity of outcomes, 412 counterfactual(s), 814, 828, 986 prediction, 999 Cournot, A. A., 43, 1154–1156, 1158, 1159, 1163, 1165, 1171 Cours de philosophie positive, 1153 Cousin, V., 1154 cover, 313 coverage, 1037, 1039, 1043, 1055, 1056 Cox, R. T., 397 creativity, 967 credal net(s), 325, 328, 329 credal network(s), 319 credal probability, 249 credence, 143, 144, 146 credibility, 1150 credible intervals, 1052, 1056 credible region(s), 292, 297, 304 criterion of evidence, 715 critical reasoning, 64
critical region, 257 critical value, 300 Critique of Pure Reason, 1158, 1166 croix vs. pile, 1152 cross classified populations, 486, 487 cross-validation, 996, 1108 cubic regression splines, 1133 cumulant, 1032, 1053–1055, 1061 generating function, 1032, 1060, 1061, 1065 cumulative distribution function, 1028, 1031, 1035, 1036, 1041, 1043, 1044, 1046, 1052–1055, 1057, 1058, 1066 cumulative hierarchy, 461 curve fitting, 17, 19, 21, 110, 486 curve of best fit, 423 cut-points, 969 cutoff subtraction, 1015 Czech book theorem, 398n d’Alembert, J. le Rond, 43, 1152, 1166 d’Amador, R., 1154 Darwin, C., 930, 931 Dasgupta, A., 34, 35, 38, 45 Daston, L. J., 1151n data, 493, 901 compression, 885 distribution, 64 dredging, 527 mining, 3, 41, 1099 database normalisation, 964 Dawid, A. P., 25, 449 de Finetti’s lottery, 412 de Finetti, B., 270, 335, 344, 1150, 1169–1171 De Morgan, A., 1160, 1161, 1164, 1166, 1171 decidability, 1001 decision, 233, 234, 517 graphs, 946, 964 problem, 268, 295, 298 rule, 997 theory, 110, 142, 268, 558
1204 tree(s), 946, 947, 950, 964 deduction(s), 957 regarded as inferior to induction by Lok¯ ayata, 1186 deductive argument, 93 consequence relation, 91 inference, 53, 473, 957 logic, 1001 validity, 53 defeated, 1012 defective code, 875 deferent, 983 definability, definable, 648, 662, 671n definition of empirical simplicity, 1002 degree(s) of belief, 143, 144, 146, 243, 245, 250, 475, 987, 989, 1016 degree of freedom, 208 degree of order, 485 degrees of falsifiabiity, 1002n degrees of testability, 856, 859 Delampady, M., 597 delta method, 1027, 1032, 1033, 1062 Dempster’s rule of combination, 432 density, 937 Descartes, R., 43, 392, 1157, 1170 idiosyncratic reaction to Indian calculus, 1184 cogito, 989 description algorithmic or effective, 646, 671n, 676–678, 681, 687–688 complexity, 676 prefix-free, 683–684, 686 descriptive data analysis, 67 descriptive set theory, 667 descriptive statistics, 70 desiderata, 516 desirability, 110 Destutt de Tracy, A. L. C., 1154 detection, 1131, 1139 deterministic, 996 questions, 1007 Diaconis, P., 122
diagnostic test, 497 diagnosticity, 106 DIC, 594 dice how was the game played in early India, 1180 hymn in R . gveda, 1180 in early India and martingale bets, 1180 in Mahabharata epic, 1181 science of, related to sampling in Mahabharata, 1182 story of Nala and R . tuparn.a, 1182 Dickson, M., 4, 5, 45 Die Zeit Constantins des Grossen, 1168n difference between probability and statistics, 62 difference set, 313 digamma function, 297 dilution, 623 direct inference, 309 likelihood, 339, 348–349, 359 principle, 988 direct instruction(s), 957 direct standardization, 821 directed acyclic graph, 822 directed paths, 824 directional angular data, 964 Dirichlet distribution, 477–479, 481, 485 discrete random variable, 72 discriminative learning, 963 disjunctions of theories, 1022 disjunctive beliefs, 1014 disposition, 139, 140 distance from knoweldge, 1009 distance from the truth, 484 divergence, 516, 519n divination, 633 DNA string alignment, 955 doctrine of chances, 1150, 1151 domination, 1008 Donkin, W. F., 1162, 1171 doomsday paradox, 726
1205 double sense probability has according to Cournot, 1156 Dowe, D. L., 22, 46, 110 dual additive measure, 410 Dubbins, L., 117, 308 Duhem, P., 4, 223, 346 Duhemian problems, 161 Dupin, C., 1153 Dutch Book, 5 argument(s), 102, 246, 398, 990 diachronic, 121, 123 E. Coli, 1096 Earman, J., 335, 373–374 Easwaran, K., 29, 45, 120 Eddington, A. S., 583, 586 Edgeworth expansion, 1053–1055, 1065, 1066 Edwards, A. W. F., 361 Eells, E., 108, 366 effective computability, see computability description, see description ergodic map, 703–705 full-measure, 673–674, 706 measure-zero, 673–674, 697–698 relative version, 693–694 open, 667–668, 673n, 692–693 relations, 694–695 relative version, 693 uniformly, 667–668, 673–675, 693, 697, 698 pursuit of the truth, 1000, 1021 specifiability, 646–647, 671n, 672 topological notions, 667–668 effects, 813 efficiency, 935, 1005 efficient convergence, 1005 efficient markets, 967 efficient pursuit curve, 1017 Einstein’s theory of gravitation, 586 Einstein, A., 583, 586, 587, 930, 931 elapsed time, 1010
elimination of nuisance parameters, 277 Ellis, R. L., 43, 1159, 1160, 1162, 1163, 1171 Ellsberg’s urns, 404 elimination of dominated alternatives, 994 Elo system, 964 elusive model paradox, 960, 961, 963 empirical adequacy, 1003n Bayes, 593, 1080, 1086, 1087, 1092 complexity, 1003, 1004 complexity set, 1005 effects, 1002 estimate, 991 laws, 984 method, 1004 presupposition, 1002 risk minimization (ERM), 851, 858 strategy, 1004 support, 483 theory, 1002 world, 1004 endpoints, 504 ensemble, 1095 risk, 1091 ENSO, 1127–1129, 1136, 1137, 1139 entailment, 753, 755, 771, 1018 non-ampliative, 760 entropy, 285, 875, 909, 921, 923, 966, 969 Entscheidungsproblem, 939, 960, 963 enumerative induction, 391n EP-calibrated objective Bayesianism, 327 EP-OBE, 328 epicycle, 983 epistemic consideration, 1002 goal, 250 justification, 249, 251, 254 motivation, 1009
1206 probability, 475, 476, 1150 calibrated objective Bayesian epistemology (EP-OBE), 328 reasons, 246–248 taste, 256 utility, 248 Epstein, L. D., 481 equidistribution, 660–661, 704–705 Weyl’s theorem, 645n, 660, 704 equipossible cases, 1150, 1151 equivocation, 322, 324 equivocator, 324, 325 ergodic, 660–661, 703–705 ERM, 853 error(s), 560, 757, 759, 770, 987, 1021 probabilities, 4, 155, 518, 519, 521 pre-data vs. post-data, 167 type I and II, 167 vs. posterior probabilities, 179 rate, 556, 1125 of hypothesis testing, 498 statistical philosophy, 154, 160 statistical test, 165 statistics, 2–4, 11, 153 and background knowledge, 159 and objectivity, 158 and the error statistical philosophy, 153 is self-correcting, 189 mistakes in inference, 160 philosophy for, 165 ESP, 625 Essai sur les fondements de nos connaissances, 1158 estimated correlation, 997 estimated distance from the truth, 484 estimated verisimilitude, 484 estimation, 64, 82, 91, 265, 294, 536, 752, 756, 759, 761, 763, 764, 766, 886 of a proportion, 266 problem, 82 estimator(s), 536
(statistic), 83 Euclidean distance, 1016, 1017 evaluation, 241 event(s), 55, 237, 266 eventual informativeness, 1005 evidence, 6, 11, 12, 14, 16, 18, 25, 252, 515, 516, 987, 1138, 1140 evidence function, 14, 15, 516, 524, 571 evidential decision theory, 142 evidential framework, 495 evidential probability, 307, 329 second-order, 307 evidential quantity, 494 evidentialism, 14, 16, 527 exchangeability, 270, 271, 457, 458, 762, 763n, 768, 1040, 1059 exchangeable, 480 events, 456 multinomial method, 477, 478 expansions of belief, 1010 expectation, 110 expected distance from the truth, 485 expected loss, 269 expected retraction times, 1014 expected squared error, 536 expected utility, 233, 240, 244, 269 expected value, 73, 74, 105 expected verisimilitude, 485, 486 experiment, 55, 79 experimental data, 991 experimental design, 964 experimental science, 814 expert probability, 107 explanation, 903, 912, 930, 931, 956, 966, 984, 1021 Exposition de la theorie des chances et des probabilit´es, 1155, 1158 expression, 1019 extra principle, 984 extraneous variation, 817 Fabius, J., 478 factor analysis, 965
1207 Fagin, R., 455 fair coin model, 646, 652–656 fairness, 633, 638 faithfulness, 109 fallacy of acceptance, 176 fallacy of rejection, 168 arising from a large sample size, 174 False Discovery Rate, 1107 false oracle, 946 falsifiability, 855 falsification, 849 Fisher, 158 Popper, 158 falsificationism, 221, 858, 859 falsificationist, 517 approach, 515 family of normal distribution, 77 family-wise probability, 505 Fan, J., 602 Feinberg, S. E., 481 Ferguson, T., 483 Fermat, P. de, 634 Fermi-Dirace, 452n Festa, R., 8, 9, 45, 473, 475, 478, 479n, 480, 481, 483, 485– 487 fictionalism, 968 fiducial probability, 760n Fieller-Creasy problem, 32, 737, 738, 745, 746 fine tuning, 958 Fine, K., 457n Finetti, B. de, 1, 6n, 8, 8n, 10, 99, 102, 112, 120, 126, 129, 398, 441–444, 456, 457, 459, 460n, 464–469, 475, 477, 752, 763n, 768 finite additivity, 412 finite Gaussian mixture models, 945 finite mixture model, 935 finite parameter space, 272 finite population central limit theorem, 1041
finite state automata, 955 first part, 939 first problem of the priors, 408 first-order approximations, 281 first-order effect, 1002 Fisher information matrix, 282, 289, 880, 942, 943, 953 Fisher, R. A., 1, 4, 90, 206–210, 212, 214, 221, 347, 361, 475, 634, 752, 755, 759, 760, 763, 766, 772, 818, 819, 1170, 1171 Fitelson, B., 335, 357, 483 five types of questions in statistics, 63 fixed bias, 1000 fixed-level testing, 89 flat priors, 291 Forbes, J. D., 1161, 1162 Formal Logic, 1160, 1161n Forster, M. R., 17, 18, 22, 45, 110, 535n, 545, 584, 589, 598, 599 foundations, 265 of decision theory, 295 Fourier coefficients, 591, 592 Fourier expansion, 591 Fr¨ommichen, K. H., 1158n free logic, 469 free parameter(s), 991, 994, 999 Frege, G., 46, 459 frequency, 1150 interpretation, 634 of probability, 73 limiting, 648, 656, 657, 660, 670n, 669–673, 698, 699, 701–704, 706 relative, 655, 656, 669, 672 view, 635–637 frequentism, 106, 541, 556 frequentist(s), 11, 263, 510, 536, 554, 556, 557, 563, 564, 577, 635, 642, 646, 668, 669, 670n, 697, 699, 703, 704, 706, 707, 1134, 1135 calibration, 267
1208 confidence intervals, 294 coverage probability, 292 inference, 493n interpretation of probability, 57 properties of Bayesian reference procedures, 292 statistics, 154, 475, 895 Neyman-Pearson vs. Fisher, 156 Bayesian discrepancy, 738 Freudian psychology, 855 Fries, J. F., 1159, 1171 FSMML, 940 fully consistent disjunctions, 1018 funnel, 994 fuzzy disjunction, 1018 fuzzy propositions, 1019 G¨odel, K., 46, 642, 663 G¨odel number(ing), 459n, 461, 665– 666, 679 G¨odel’s incompleteness theorem, 648, 682–683 Gaifman’s condition, 322, 341 Gaifman, H., 107, 341, 373, 460 Galileo, 595, 596, 634, 1150 GAM, 1131 gambling, 238 strategy, see betting strategy system, 654–656, 670, 692, 699 games, 633–635 gaming, 634 Gamma density, 280 Gan.ita S¯ ara Sam . graha, 1178 gap or no gap, 932, 941 gappy, 932, 941, 947, 957 Gardner, M., 587 Garibaldi, U., 481n Gaussian, 935 competition(s), 920, 922 distribution, 70 mixture modelling, 965 Gauvreau, K., 13 GC-method, 478–481, 485
gene, 1095, 1096 generality, 304 generalization Carnapian method, see GC-method cross-validation, 1133 of conditionalisation, 253 exponential family, 275 probability conjunction rule, 60 generative learning, 963 geometric, 1088 distribution, 1037 heuristic, 1081, 1082 random variable, 1037 geometry, 1018 Gettier, E., 1014, 1015, 1021, 1022 Ghosh, J. K., 24, 25, 45, 46n, 583, 588, 592–594, 597, 599 Gibbs sampler, 305 Gillies, D., 449 global reliability, 521 globalistic approach, 477 Glymour’s “bootstrap”, 394 Glymour, C., 366 God, 991 Godambe, V. P., 44 G¨odel’s incompletness theorem, 960 Goldstein, M., 107 Good, I. J., 478, 481, 483, 483n, 1166 Goodman’s new riddle of induction, 859n Goodman, N., 364, 857n, 859n goodness of fit, 17, 866, 901 Gossett, W., 205–207 Gouraud, C., 1154 Gr¨ unwald, P., 21–23, 46, 110 grand mean, 1077, 1080, 1084 graphical analyses, 65 graphical model(s), 909, 928, 947, 963 graphical summaries, 65 gravitational field, 267, 278 great human intellects, 930 Greaves, H., 121 Greenland, S., 25, 26, 28, 46 Greenwood, M., 43
1209 Grossman, J., 18, 19, 45 group invariance, 1166n grue, 2, 954, 955, 958 guarantee worst-case, 870 Gupta, A. K., 481 H´ ajek, A., 28, 29, 45, 111n, 112–114, 118, 130, 131, 339, 343, 344 Hacking, I., 44, 107, 361, 537, 1170, 1178 Halayudha, 1175 Haldane, J. B. S., 1191 Hall, N., 130, 131 Halpern, J. Y., 459n halting probability, 691–692 relative, 694 problem, HALT, 666–667, 939, 960, 961, 963, 967 happiness-conducive, 1002 hard choice, 1012 Harman, G., 20, 46 Harper, J., 43, 46 Hartley, D., 1151n Hausman, D., 109 Hawthorne, J., 5, 6, 8n, 11, 14, 45, 366, 368, 370, 376, 378, 381– 383 HC, 461 heavy-tailed distribution, 79 Hegel, 1158 Hegelianism, 576 Hellinger distance, 593 Hempel’s “satisfaction” criterion, 394, 392 Hempel, C., 92, 109, 392 Herbrand, J., 663 hereditarily countable sets, see HC hereditarily finite sets, see HF Herschel, J. F. W., 1160–1162, 1171 Hesse, M., 481 HF, 461 hierarchical classification, 947
Higgs boson, 966, 969 high variance, 79 Hilbert’s program, 663 Hild, M., 121 Hildebrand, D. K., 487 histograms, 66 history of science, 983 history of the normal distribution, 76 Hjort, N. L., 603 Hoeting, J., 603 Holm-Bonferroni method, 1107 Howson, C., 8, 10, 45, 335, 368, 450, 467n, 536 HPD regions, 297 Huffman code, 903, 913, 938 construction, 903, 905 Hume, D., 2, 43, 392, 814, 1151n, 1154, 1158 humour, 967 Humphreys’ paradox, 139, 140, 142 Humphreys, P., 138–141 Huygens, C., 441, 634 hybrid Bayesian network, 947, 963 hyperarithmetical (∆11 ), 463n hyperarithmetical hierarchy, 696 hypothesis, 199, 201n, 203, 204, 206, 207, 210, 214, 217, 220, 223, 517, 751, 755, 757–761, 763– 772, 865, 901 testing, 64, 87, 90, 265, 294, 298, 499, 505, 738, 919, 964, 1106 hypothetical induction, 394 hypothetico-deductive framework, 514 hypothetico-deductive reasoning, 225 I1D , 940 ideal agents, 254 ideal codes, 876 Ideal Group (IG), 940, 941 ideal MDL, 872 idealizing condition, 486 IG, 940, 941 ignorability, 820 strong, 820
1210 weak, 821 ignorance prior, 411 IIP, 963 image recognition, 947 importance sampling, 304 imprecise hypothesis testing problem, 742 improper, 117 distribution, 450 prior, 607, 611, 614, 618 function, 274 inaccessible truths, 1003n inclusion bias, 827 incompressibility, incompressible, 38, 646–647, 681, 686, 687, 689, 705–707 b-incompressible, 680, 681, 686– 689 incremental measure of confirmation, 483 indefeasibility accounts of knowledge, 1001 independence, 104, 143 independent and identically distributed (i.i.d.), 1109 independently confimable principles, 984 indexical beliefs, 121 indifference, 989 indirect inference, 309 induction, 1, 903, 912, 930, 931, 956 inductive acceptance, 484, 485 inductive argument(s), 55, 200 inductive disjunctive fallacy, 410 inductive generalisation, 394 inductive inference, 53, 91, 93, 473, 475, 896, 901, 930, 931, 956, 985, 1099 inductive inquiry, 1008 inductive logic, 8, 9, 473, 474, 484, 487, 751–754, 756, 769, 772, 1001 ampliative, 755 Bayesian, 769, 771
Carnapian, 761, 762, 764, 767– 769 non-ampliative, 763 programming, 963 inductive probabilities, 473 inductive rules, 480 inductive-statistical explanation, 109 inference, 62, 70, 751, 752, 754, 755, 759, 771, 813, 903, 912, 930, 931, 956, 962, 966 ampliative, 758n Bayesian, 764, 769, 770 inductive, 761, 763 statistical, 772 inference about uncertainty, 92 inference on normal parameters, 293, 298 inference procedure, 557, 558, 564 inference summary, 265, 294 inferential analysis, 67 inferential rational in testing., 158 infinitary language, 460 infintesimal proability, 111n information, 263, 572 complexity, 676, 682, 683, 689, 706 content, 675, 680, 680n, 681, 692 criteria, 523, 524 theory, 284, 648n, 680n, 682, 869, 902, 903, 936 Shannon’s (entropy), 680n theoretic, 264, 296, 902, 913, 936 informationless priors, 8 instantaneous code, 907 integral representations, 271 intelligence, 901, 930, 956–958 testing, 201, 218 interpretations of probability, 55, 106, 264 intersection, 56 interval(s), 64 estimation, 87, 292, 297, 485 scale, 65 of values, 404
1211 intervening vs. conditioning, 140 intervention(s), 987, 999 intrinsic classification, 965 credible regions, 298 discrepancy, 296, 300 estimation, 296 estimator, 287, 296 hypothesis testing, 299, 301 limit, 285 invariance, 286, 296, 297, 409, 932, 953 invariant minEKL, 954 invariant prior(s), 618, 953, 954 invariantised density, 609, 612, 615 inverse-probability weighted, 821 James-Stein estimation, 947, 1077, 1079, 1080, 1086, 1087, 1091, 1093 Jaynes, E. T., 126, 325, 335, 397, 446, 450, 451 Jeffrey, R. C., 110, 335, 344, 367 Jeffrey conditionalization, 122, 123, 130 Jeffrey’s rule, 432 Jeffrey’s measure, 607, 609, 612, 613, 615, 618, 619, 621–624, 626 conditional, 618 marginal, 618 Jeffreys, H., 99, 118, 335, 475, 1150, 1166, 1170, 1171 Jeffreys’ prior, 286 949, 954 Johnson, W. E., 99 joint probability mass function, 75 Joyce, J., 335, 344, 593 Judy Benjamin problem, 125 justification, 985, 987, 1009 of Ockham’s razor, 1021 justifying Bayesian updating, 990 K-triviality, 697 Kadane, J. B., 117, 126 Kahneman, D., 433 Kant, I., 43, 1158, 1166, 1167 Kelly, K., 23–25, 46, 858–860
kernel, 274 Keynes, J. M., 99, 118, 308n, 335, 364, 403, 451, 452, 454, 459, 1150, 1159, 1167, 1171 king of Siam, 1165 KL distance, see Kullback Leiber (KL) distance KL divergence, see Kullback Leiber (KL) divergence Kleene, S., 663, 673n knowledge discovery, 1099 Knuth, D., 645n, 661 Kolmogorov complexity, 21, 36, see complexity, 866, 873, 896, 902, 910, 912, 938, 939, 946– 948, 955 Kolmogorov’s axiomatization, 100, 118 Kolmogorov, A. N., 28, 29, 36, 99, 111, 115, 117, 120, 132, 457, 463, 635, 645n, 647, 656, 668, 669, 673n, 675, 680n, 896, 909, 912 Koltchinski, V., 603 Kothari, D. S., 1191 Kraft’s inequality, 691, 875, 907 Krauss, P., 341, 460, 461, 463 Kriy¯ akramkar¯ı, 1176 Kuhn, T., 447, 814 Kuipers, T. A. F., 478, 480 Kulkarni, S., 20, 46 Kullback-Leibler, 265, 909, 914, 922– 924, 927, 931, 933–935, 941, 952–954 Kullback-Leibler discrepency, 1132 Kullback-Leibler distance, 17, 18, 611, 924 Kullback Leibler (KL) divergence, 22, 518n, 519, 524, 592, 593, 599, 924, 925, 1017, 1018 599 kurtosis, 1046, 1061 Kyburg, H. E., 8, 92, 307, 309, 315, 317, 322, 349, 442, 443, 469 L(ω1CK ), 462, 463n
1212 La Harpe, J.-F., 1154 Laing, J. D., 487 Lambalgen, M. van, 648, 669, 692, 694, 696, 698 Lange, M., 123, 370 language, 955, 958 Laplace central limit theorem, 1028– 1030, 1032, 1035, 1038, 1041 Laplace method, 1064 Laplace, P. S., 43, 446, 455, 634, 644, 654, 675, 676, 681, 687, 688, 1149, 1151n, 1153–1155, 1160, 1161, 1163, 1168, 1171 large p, small n, 1095 Laurent, H. H., 1164 law of errors, 200 frequencies, 703–704 iterated logarithms, 659–660, 673, 691 large number(s), , 373, 739. 1029, 1063, see Borel strong law and fear of small numbers, 1185 law of likelihood, 15, 18, 357– 361, 496, 501, 517–519, 537, 540, 541, 574 of probability, 240, 248 of nature, 955 randomness, 656, 661, 673 symmetric oscillations, 659, 672, 703 law of total probability, 59, 103 learning, 901 process, 273 theoretic paradigm, 399 least square(s), 1073, 1088 difference, 486 Lebesgue measure, 648–653, 656, 692, 703, 705n Leblanc, H., 118 left classical equivalence, 310 Lehmann, E. L., 583 Leibniz, G., 43, 1150, 1152, 1158 Lele, S., 14–16, 45
Letters on Probability, 1162 level of confidence, 86, 87 Levi, I., 113, 121, 123, 308n, 335, 349, 475 Levin, L. A., 648n, 683 Lewis, D., 121, 127–130, 335, 348, 400, 465, 1169 Li, K. C., 584, 602 likelihood, 103, 333–336, 339, 345, 347– 387, 901, 1046, 1050, 1051, 1066, 1131 based statistics, 3 framework, 11, 14 function, 11, 274, 497, 517, 561, 569, 571–578, sub 1029, 1048, 1051, 1056, 1064, 1067 interval, 540 paradigm, 496, 505 principle, 3, 18, 183, 185, 274, 360–361, 497, 518n, 519, 553, 554, 557, 561–563, see LP, 564, 566–569, 571, 573–578, 771, 966 weak vs. strong, 185 ratio, 355–358, 361–363, 365, 369, 371–375, 377–387, 496, 497, 505, 517–520, 522, 523, 525, 528, 574 Convergence Theorem, 357, 362– 363, 369, 371–384, 387 theory of evidence, 421 likelihoodism/ists, 2, 11, 45, 357–361, 371, 421 L¯ıl¯ avat¯ı, 1176 limited information, 291 limited translation estimator, 1091 limiting frequency, see frequency Lindeberg-Feller theorem, 1031, 1067 Lindenbaum algebra, 460 Lindley’s paradox, 32, 34, 302, 737, 738, 741–744 Lindley, D. V., 481 linear, 1131 causal network, 1003
1213 combination, 79 model, 1043 regression, 593 transformation, 77 trend, 1125, 1140 linguistic propositions, 515 linguistic theory, 516 LNPP-space, 936 LNPPP, 960 local reliability, 521, 522, 527, 528 location model, 291 Locke, J., 1151n log-likelihood, 584, 901, 931 log-loss, 916, 952 compression-based competition, 922 scoring, 914, 928 logarithmic divergence, 265, 281, 284 logarithmic loss, 889, 916 logic 3-valued, 1192 and sy¯ adav¯ ada, 1192 Buddhist, 1191 Buddhist logic as quasi truthfunctional, 1193 Buddhist logic leads to quantum probabilities, 1194 catus.kot.i, 1193 Jain, 1191 Jain logic as 3-valued, 1192 not culturally universal, 1194 of 4-alternatives and Schr¨ odinger’s cat, 1193 of 4-alternatives as quasi truthfunctional, 1193 pre-Buddhist, 1192 quasi truth-functional, 1191 quasi truth-functional and quantum mechanics, 1191 Logic of Chance, 1159, 1160, 1162, 1163 logical consistency, 1010 logical interpretation, 106 logical probability, 144–146, 1150
logical probability function, 456 logical retractions, 1020 logically independent, 61 logistic regression, 1049 logit function, 1049 long run, 990 convergence, 990, 997n loss, 1076, 1079, 1080, 1082, 1083, 1086, 1092, 1093, 1095 entropy, 1095 function, 10, 295, 398n, 560 lot, 633 lottery, 317, 495, 633, 636 paradox, 204, 722, 1108 lowest posterior loss, 298 LP, 567, 569, 572 luckiness, 869, 887 function, 897 principle, 871, 872, 877 Lugosi, G., 603 MG , 521–523, 526 ML , 521, 523, 525 Martin-L¨of randomness, see randomness, Martin-L¨ofMartin-L¨of, P., 645n, 646, 653, 673, 673n, 674, 675, 689, 696 Martin-L¨of-Chaitin thesis, 649, 690, 705–708 m† , 454 M´er´e, C. de, 634 machine learning, 901, 1021, 1099 MacQueen, J. B., 483 magnetic field, 934 Mahabharata, 1181 notion of fair gambling in, 1181 relation of dice to sampling theory, 1182 Mahalanobis, P. C., 1191 Maher, P., 123, 335, 481, 485 Makinson, D., 443n Mallow’s statistic, 996 manipulation(s), 813, 814 MAP, 939, 941
1214 margin of error, 1093, 1094 marginal effects, 820 likelihood, 525, 607–609, 618, 621 mass functions, 75 probability, 57, 58, 901, 937, 953 marginalization, 288, 290 marginalization paradox, 622 Markov chain(s), 874 Monte Carlo (MCMC), 304, 940 Markov condition, 319 Markov decomposition, 823 Markov model, 962 marksman, 991, 992 Martin-L¨ of, P., 36 Martin-L¨ of-Chaitin thesis, 38 martingales, 38, 673, 697–703, 706 computable, 700–701 computably enumerable, 700, 707 fairness condition, 698, 699 partial computable, 700–701 Marxism, 855 material theory of induction, 400 material vs. formal view of logic, 1163 Maximum A Posteriori (MAP), 934, 939, 941 maximum entropy, 285, 411 maximum likelihood, 574, 575, 584, 752, 755, 759–761, 764, 766, 932–934, 941, 947, 952, 953, 964, 1073–1077, 1079, 1080, 1082–1084, 1087, 1088, 1090– 1092, 1096, 1134, 1135 estimation, 9, 281, 423, 557, 772, 991, 993, 1048, 1051 maximum posterior expected entropy, 1166n Mayo’s test severity, 521 Mayo, D., 4, 11, 16, 19, 45 McCarthy, D., 127 McGee, V., 119, 132 McMichael, A. F., 483 MDL, 895, 944, 948–950, 964 mean, 65, 66, 1123
squared error (MSE), 992 measurable, 650–652, 657, 703–704 bijection, 653 measure(s) of critical tendency, 65 of dispersion, 68 of evidence, 493 -zero, 650, 651, 651n, 654 full-, 653, 654, 696 Lebesgue, see Lebesgue measure preserving, 703–704 measurement error, 515, 828 measurement of a physical constant, 267 median, 65, 66, 1123 medieval scholastic terminology, 1169 Mellor, H., 1150 memory, 957 Mendel, G., 208–210, 212 message length, 958 meta Jeffreys measure, 613, 615, 621, 626 metaphysical argument, 989 method, 985 methodological principles, 1004 Metropolis algorithm, 305 Michell, J., 1161, 1164 microarray, 1095 Mill’s ratio, 1064 Mill, J. S., 43, 392, 1159, 1160, 1163, 1164, 1171 mimimal sufficient statistic, 1043 minEKL, 932, 933, 940, 941, 953, 957 minimal cover under difference, 313 minimal sufficient statistic, 1060 minimally sufficient test statistic, 259 minimax, 996, 1008 minimax optimal rate, 887 minimax regret, 996 minimizing MSE, 996 minimum description length (MDL), 21, 865, 902, 928, 940, 944, 948, 949, 996 inference, 110
1215 minimum message length (MML), 21, 896, 901, 928, 953, 956, 962 inference, 110 minimum sufficient statistic, 939 miracle, 967 Mis-Specification (M-S) testing, 190 a parametric test of independence, 192 and the use of the residuals, 194 the runs test, 191 vs. Neyman-Pearson (N-P) testing, 190 Mises, R. von, 35, 36, 634, 635, 642, 645n, 646, 648, 654, 668– 670, 670n, 671, 671n, 672, 699, 703–707, 1150 and randomness, see randomness, von Misesmisleading evidence, 13, 494, 503, 504, 518, 520, 522, 525, 526 missing data, 820 missing information, 285 missspecification, 935 mixed strategies, 1013 mixture modelling, 928, 965 ML plug-in code, 883 MLE, 993 MML, 929, 930, 933–936, 940, 941, 944–948, 950, 951, 953–956, 958, 962–964, 966–968 subitem estimator, 932 subitem mixture modelling, 947 subitem -like, 949 MMLD, 940 mnemonic, 968 mode, 65 model9s), 17, 514, 560, 869, 874, 991 averaging, 962 building, 813 complexity, 594 misspecification, 934, 944, 945, 949 order selection, 949 robustness, 518
selection, 3, 24, 25, 524, 874, 876, 887, 997 consistency, 888 vs. statistical model specification, 195 selection bias, 527 specification, 813 switching, 897 validation, 189 modus ponens, 957 Moivre, A. de, 634 moment generating function, 1090 monkey, 1101 monotonicity, 53, 54, 1001 monotonic reasoning, 55, 91, 93 Montague, R., 459n Monte Carlo resampling, 1107 Montmort, P. R. de, 441 Monty Hall problem, 30, 32, 123, 720, 721 Moore’s paradox, 107 Moore, J., 43, 46 Morgan, A. de, 43 Mosimann, J. E., 478 motif de croire, 1154 MSE, 993, 994, 996, 997 muddy Venn diagram, 416 Mukhopandhyay, N. D., 593 multicategorical experiment, 480, 481 multinomial inference, 476, 477 multinomial method, 476, 479–481, 485 multinomial process, 476, 479–481 multiple comparisons, 493, 495, 504, 526 multiple decision makers, 561 multiple hypothesis testing, 1106 multiple looks, 493, 495 multiple testing, 1125 multiplication, 80, 415 rule, 59, 60 rule of probabilities, 466 multiplicity, 524 multiply robust, 822
1216 multivariate central limit theorem, 1030, 1031, 1063 Mura, A., 465 mutual independence, 60, 61 mutually exclusive events, 56, 61, 72 mutually independent events, 80 n-randomness, see randomness narrowness, 415 nat, 907, 950 natural ban, 907 negative binomial, 271, 1036, 1038 distribution, 1034, 1037, 1038 random variable, 1038 nested model, 584, 588 Neumann, J. von, 643 neural nets, 964 neutrality, 478 Newcomb’s data, 84 Newcomb’s measurement, 82, 83 Newcomb’s problem, 32, 729, 730 Newton, I., 930, 1002, 1008, 1194 Neyman vs. Carnap, 154 Neyman, J., 1, 90, 214, 221, 475, 634, 1171 Neyman-Pearson, 523 Neyman-Pearson (N-P) test, 9, 167, 752, 755–759, 763, 769, 772 Neyman-Pearson test sizes, 521 Neyman-Scott, 932, 933, 941, 947, 957, 964 panel data, 947 Nicod’s principle, 254, 255 Niiniluoto, I., 473, 475, 480, 481, 484n, 485, 486 nit, 907, 950 NML, 948, 949, 954 no short path(s) assumption, 1011, 1015n noise, 902, 1103 noiseless coding theorem, 875 nominal, 64 scale, 65 non-additive measures, 407
non-Bayesian shifts, 417 non-circular epistemic argument, 1008 non-conglomerability, 125, 126 non-experimental data, 1000 non-human intelligence, 958 non-measurable sets, 114 non-metric scale, 65 non-monotonic, 1001 non-universality probability, 913 non-identification, 814, 815, 828 non-ignorability, 817 non-informative prior, 283 non-informative priors, 275, 411 non-linear, 1131, 1138 trend, 1140 non-monotonic logic(s), 311, 326, 329 non-monotonic reasoning, 54, 91, 93, 1111 non-parametric regression, 591, 592 non-randomized studies, 820 non-rejection, 257, 258 normal, 289 approximations, 38 distribution, 70, 71, 76, 84, 200– 202, 203n, 206, sub 1073– 1075, 1081, 1087, 1088, 1090, 1092, 1095 Ockham method, 1005 Ockham strategies, 1008 parameters, 277 probability, 1091 normalised maximum likelihood, 880, 949 normative axioms, 239 norms of inference, 557 Norton, J. D., 7, 8, 11, 25, 45 nose, 993 notational brevity, 984 novel policies, 997 nuisance parameter(s), 277, 283, 518n, 525 null hypotheses, 87, 89, 201, 203, 207, 210–214, 216, 218, 220, 221, 223, 553, 816
1217 in testing, 155 nultinomial distribution, 1096 O(1), 913 objections to the over-fitting argument, 996 objective/objectivity, 263, 265, 284, 303, 988 chance, 991, 1169 constraints, 990 invariant prior, 954 priors, 953, 954 probability, 236, 988 objective Bayesian/ism, 9, 314, 557, 950, 953 epistemology, 322, 329 methods, 281 observational science, 813, 826, 828 Ockham efficiency theorem, 1007, 1018, 1014, 1016, 1021 Ockham violator, 1009 Ockham’s razor, 24, 585, 858–860, 866, 891, 903, 928–930, 956, 983, 985–989, 991, 994, 996, 999, 1000, 1004, 1009, 1105 old evidence, 253 one standard deviation of the mean, 70 one-part message, 936 one-tailed significance test, 211 open sets, 649–651 effective, see effective operational meaning, 304 optimal efficiency, 987 optional stopping, 186 options, 236, 237 Oracle, 584, 592, 600, 601 Order 1, 913 order of magnitude, 1058, 1059 order statistic, 1043–1046, 1060 ordinal, 64 scale, 65 originality, 967 orthodox statistics, 475
outcome, 237, 245 over-aim, 993 over-fitting, 41, 868, 869, 993, 997, 1007, 1102 argument, 996, 1000 P -value, 90 p-value(s), 166, 213, 216, 222, 493, 495, 499, 500, 506, 517, 520– 522, 739, 741, 742 vs. posterior probability, 179 P-atom, 129 Pagano, M., 13 pairwise independence, 61 panel data, 933, 941, 944 paradox of indifference, 989 paradox of the raven, 392 parameter, 264, 270, 271, 989 estimation, 759 parametric complexity, 880 parametric regression, 591 Pareto, 1009 partial assignment, 966 computable, 665–666, 668, 685, 686 computably random, see randomness, partial computable correlation, 1003 order, 402 particle physics, 966 Pascal, B., 441, 634 path-independent, 1020 paticca samupp¯ ada, 1194 P¯ a.t¯ıgan.ita, 1176, 1178, 1180 Pauler, D. K., 602 PCCP, 128, 129 conditional, 129 function, 129 PDO, 1127–1129, 1136, 1137, 1139 Pearl, J., 124 Pearson, E. S., 90. 206, 207, 214, 215, 221, 475, 634 Pearson, K., 202, 203, 205, 207, 208
1218 Peirce, C. S., 1163, 1171 penalized likelihood rule, 584 penalized log-likelihood, 585 pendula, 1002 per datum predictive accuracy, 545 perfect illusion, 1003n permutation central limit theorem, 1045 permutation sampling distribution, 1047 permutation test, 818, 1043–1045, 1060 permutation-invariant, 446 permutations and combinations and Indian music, 1175 and large numbers, 1176 continuous tradition from Vedas to Bhaskara II, 1176 Pi˜ ngala’s rule and binomial expansion, 1175 typical formula for combinations, 1177 Peter principle, 969 Phillips Information Criterion, 949 philosophical can of worms, 895 philosophy, 1021 of statistics, 1, 2, 46, 47 phrase structured grammars, 866 physical probability, 1150 Π11 , 463n PIC, 949 Pi˜ ngala, 1175 pivotal inference, 557 pivotal quantity, 1027, 1055 place selection (rule), 669–672, 698– 699, 701–706 admissible, 671–672, 706 computable, 672 partial computable, 672 plain algorithmic complexity, see complexity planatory power, 984 planetary astronomy, 983 Plato, 392 plausibilistic inference, 474, 475, 484 plausibility, 269, 474 Poincar´e, H., 1168, 1169, 1171
Poinsot, L., 1153 point estiamte, 949 point estimation, 82–84, 292, 295, 485 Poisson distribution, 965 Poisson, S.-D., 43, 271, 280, 441, 1081, 1149, 1153, 1155–1157, 1159, 1171 Poisson-Gamma mixture, 280 policy predictions, 987 political survey, 267 polynomial curve, 984 degree, 986, 1012 law, 1010 regression, 947, 964 theories, 1012 Popper dimension, 856, 857, 859 Popper function(s), 29, 119, 122 Popper, K., 20, 29, 99, 118, 221, 225, 432, 473, 483, 484, 849, 854– 856, 858–860, 1150 fallacy in his solution of problem of induction, 1185 theories of verisimilitude, 484 population distribution, 80 mean, 70, 80 standard deviation, 84 variance, 69, 80 Port Royal logic, 442 possible observations, 561 possible world, 968 post data, 522, 527 Post machine, 662–666, 681–683 instruction set, 663 programs, 663–666, 668, 676–677, 681–683 universal, 666 Post, E., 663 post-data reliability, 519 posterior cumulative distribution function, 1056 density, 1050, 1052, 1053, 1056,
1219 1057 distribution, 274, 275, 929, 1027, 1050–1053, 1056, 1057, 1064, 1067, 1086, 1087, 1169 see also posterior probability mean, 295, 953 mode, 295 model probabilities, 607, 609, 615 odds, 584, 612, 614 probability, 234, 252, 347, 351– 359, 361–364, 369, 371–373, 375, 379, 387, 477, 493n, 500, 572, 584, 765–768, 770, 771, 901, 933 see also posterior distribution quantile, 295 ratio, 988 potential outcome, 814, 815, 828 potential probability statement, 312 power, 526, 757, 759, 770 of a test, 167 practical certainties, 314 pragmatic considerations, 1002 instrumentalist, 997 motivation, 1009 reasons, 246 virtues, 986 pre-data, 522 precise fractions comparison of Indian and Roman numeration, 1179 in late Greek tradition and Ptolemy, 1179 precision, 401 prediction, 11, 17, 20, 28, 267, 280, 813, 814, 869, 903, 930, 931, 962, 966, 983, 993, 1000 logic, 487 predictive accuracy, 17, 18, 20, 535, 541, 913, 952, 986, 1021 distribution, 280, 615 distributions, 279
inaccuracy, 996 inference, 279 MDL, 881, 889 probability, 478–481 predictivistic approach, 477 preference(s), 236, 238, 245, 269 formation, 242 prefix code(s), 870, 874, 907 prefix-free coding, 683–684 complexity, see complexity set, 683–685, 691, 692 premise disjunction, 310 prequential forecasting system, 882, 889, 890 prequential model validation, 890 preserving content, 1020 prevalence, 273, 498 principal components, 1049, 1050 principal principle, 107, 138, 140, 144, 245, 348–349, 400 principle of direct probability, 107 principle of indifference, 112, 408, 458, 713–718, 720, 767, 1166n principle of parsimony, 22, 585, 866 principle of positive instantial relevance, 457n principle of total evidence, 772 prior bias, 987 toward simplicity, 986 prior degree of belief, 987 prior density, 1067 prior distribution, 275, 478, 479, 481, 577, 1051, 1086, 1087 prior odds, 613 prior probability, 234, 252, 335, 336, 347, 350–365, 370–373, 387, 477, 498, 558, 572, 576, 584, 760n, 764–769, 771, 901, 988, 990, 991, 994 distribution, 272 prior ratio, 988 prior-free decision rule, 994 priors constructed by some formal rule,
1220 411 prisoner’s dilemma, 734 probabilistic causation, 108 competition, 921 conclusion, 91–93 diagnosis, 266, 273, 275 independence, 60, 61, 204 indifferentism, 1019 inference, 91–93 logic, 319, 320 prediction, 915 prediction competition, 922 predictions, 921, 927 process, 199, 212 reasoning, 93 reasoning in ancient India, 44 retractions, 1020 scoring, 913 probabilistic/statistical reasoning, 54 probabilities of conditionals, 127, 967 probability, 43, 199, 201–204, 206– 217, 220, 263, 265, 266, 322 as long-run frequency, 739 assertions, 460 assessment, 291 axiom, 246 density function, 75, 1046, 1048, 1050, 1054, 1055, 1058, 1059, 1065, 1066 distribution, 71–73, 79, 269 failure of frequentist interpretation, 1183 frequentist interpretation and supertasks, 1185, 1186 function, 100, 322, 323 has a double sense according to Cournot, 1155 its role in inductive inference, 153 mass function, 73, 1035–1038, 1041, 1046, 1048, 1050, 1058, 1059 matching priors, 293 model, 57, 64, 71, 80, 263, 272 not ampliative, but estimates may
be, 1185 of single events and quantum mechanics, 1191 of single events and subjectivist interpretation, 1190 of observing misleading evidence, 501, 504 on lattice of subspaces of a Hilbert space, 1192 over a Boolean algebra of statements, 1191 paradoxes, 32 quantum interference, 1190 probability space, 100 /statistical reasoning, 53 theory, 264, 271 probability1 , 441 probability2 , 441 probable verisimilitude, 486 problem of induction, 2, 19, 46, 1004 of observational errors, 486 of old evidence, 6, 402 of subjectivity, 14 of the reference class, 312 solving, 1004 process variation, 515 product of normal means, 293 profile likelihood, 518, 519, 525, 525n promiscuous grammar, 868 propensity, 139–142, 146, 1154 interpretation, 106 scores, 821 proper likelihood, 525n property holds always, 1006 proportion of infected people, 287 propositions, 237 prosecutor’s fallacy, 179 protein folding, 965 pseudo-randomness, see randomness psychological probability, 1150, 1166 psychologism, 1159 Ptolemaic theory, 983 Ptolemy, 984
1221 public decision making, 283, 303 pun, 967 pure (deterministic) strategy, 1013 pursuit of the truth, 1001, 1002, 1016 puzzle, 986, 1000 Q-predicate, 455 quadratic loss, 292 qualitative evidence, 323 qualitative theory, 486 qualitative variables, 65 quality assurance, 268, 280 quantitative evidence, 323, 324 quantity of interest, 283 quantum mechanics non-existence of joint probability distributions, 1191 probability interpretation of wavefunction, 1190 structured-time interpretation, 1191 Quetelet, A., 1149, 1162 Quine, W. V., 4, 223, 346 quiz shows, 920 R´enyi, A., 466 Radon-Nikodym theorem, 116 Raftery, A., 603 Raju, C. K., 43, 44, 46 Ramsey test, 127 Ramsey, F. P., 127, 335, 344, 1442, 459, 150, 1171 random, 962 methods, 1014 number generators, 636 sample, 270, 271 sampling, 1109 strategies, 1021 variable(s), 71, 105 randomization, 817, 823, 826, 1039 randomized experiments, 818 randomness, 34–36 n-randomness, 662, 694–696 1-random, 662, 695, 696 2-random, 694–696 absolute, 662
algorithmic, 645–648, 648n, 662, 705, 706 arithmetical, 695 Church, 672, 701 computable, 700–702 existence of random strings, 680, 687 hyperarithmetical, 696 Kolmogorov-, 696 Kolmogorov-Chaitin, 688–690, 696, 706, 707 Kolmogorov-Loveland, 701–703, 707 Martin-L¨of, 646–648, 653, 673– 675, 689–690, 695, 697, 698, 700–703, 705, 705n, 706, 707 Mises-Wald-Church, 671–673, 690, 702, 704 of sequences and strings, 641–648, 653–662, 668–676, 680–681, 683, 686–708 partial computable, 700–702 pseudo-randomness, 642, 643, 661 random real number, 691 random walk, 658–659, 672 relative, 692–694 Schnorr, 697–698, 701, 702, 705n Solovay, 673–675, 689 stochastic, 653 laws for, 655–657, 659, 661, 672, 689, 706 von Mises, 646, 648, 668–673, 698, 703–707 Ratio, 1159n ratio, 64 analysis, 111, 132 formula, 100 rational reconstruction, 473, 483, 485, 486 rationality, 990 rationally update, 987 real values, 402, 405 realism, 20 Recherches sur la probabilit´e des juge-
1222 ments, 1153, 1154 recursively enumerable, 461 red herring, 962 reference analysis, 10, 264, 283, 284 class, 1104 class problem, 41 credible intervals, 294 distribution, 284 formula, 309 marginal density, 616, 624 marginal likelihood, 617, 621 measure, 615 model probabilities, 607, 616, 621 posterior, 283 posterior odds, 624, 626 prior(s), 283, 285, 286, 288, 289, 293, 303, 411, 615 and the LP, 186 referendum, 275, 276 refined MDL, 897 reflection principle, 107 refutability, 1001 refute and rescale, 415 refuted, 1012 regression, 816, 1088 analysis, 486 coefficients, 206 model, 623 regret, 869, 870, 996 worst-case, 870 regular, 111 regularity condition, 1027, 1048, 1050, 1054, 1056, 1064, 1066 Reichenbach, H., 108, 1150 rejection, 257, 258, 299 relative frequency, 55, 247, 266 relative frequency interpretation of probability, 397 relative frequentists, 217 relative randomness, 638 relative satisfiability, 445 relativity to evidence, 256 relativization, 693
relativized conditional probability, 144 relevant events, 268 relevant statistics, 314 reliabilism, 520 reliability, 499, 516, 521, 523 reliable indication, 985, 1000, 1021 Renyi, A., 99, 118 repeated examination of accumulating data, 502 representation theorem(s), 239, 270, 271, 303 representative sampling, 1110 response bias, 827 restricted posterior, 277 restricted problem, 1004 restricted reference priors, 291 retraction(s), 277, 298, 986, 987, 1001, 1021 complexity, 1001 degrees, 1014 efficient, 1007 times, 1009 minimization, 1002 retrograde motion, 983, 984 R . gveda hymn on dice, 1180 richness, 313, 314 rifle welded, 997 right weakening, 310 right/wrong errors, 952 right/wrong scoring, 919, 950, 951 rigidity, 467 risk, 1076, 1082–1084, 1086, 1087, 1090, 1091, 1093, 1095 Rissanen, J. J., 602, 896, 948, 950 Robbins, 41 robust regression, 1129 Roeper, P., 118 Romeijn, J. W., 8, 9, 46, 481 Rooij, S. de, 21–23, 46 Rosenkrantz, R., 335 Rosenthal, H., 487 Royall’s questions, 513 Royall, R., 3, 4, 10–12, 16, 19, 359,
1223 361, 540 Royer-Collard, P. P., 1154 rule of succession, 455 Russell, B., 46, 1150 safety belt, 268, 280 Sakamoto, Y., 535n Salmon, W., 108 Samanta, T., 583, 588, 597 sample, 199–202, 205, 206, 209, 210, 216, 218, 222–224, 757 sample mean, 70, 80 proportion, 82 size, 80, 87, 501, 503 space, 55, 57, 72, 274, 510, 753– 756, 758–761, sub 763, 764, 769–771 standard deviation, 84, 86 statistic, 199 variance, 68, 69 sampling, 199 distribution, 64, 80, 83, 84, 86, 294, 1027, 1041, 1053 sanity check, 870 satisfaction set, 310 satistical inconsistency, 947 Savage’s representation theorem, 259 Savage, L. J., 335, 344, 361, 373, 398, 442, 475, 1150, 1170 scale invariance, 516n Schervish, M. J., 117 Schnorr’s theorem, 647, 648, 688, 689, 696, 707 Schnorr, C. P., 689, 697, 698 Schulte, O., 859n Schwarz, G., 110, 536, 541, 597, 940 Scientific inference, 1166 scientific communication, 283 data analysis, 299 evidence, 493 hypotheses, 556 method, 983 realism, 858
reporting, 303 theory, 983 scoring probabilities, 914 scoring rule, 398n Scott, D., 341, 460, 461, 463 Scozzafava, R., 443, 466 Scriven, M., 22 Searle, J. R., 956 second part, 939 second problem of the priors, 428 second-order effect, 1002 second-order evidential probability, 315 Seemann, T., 922 segmentation, 969 Seidenfeld, T., 117, 126, 308n, 449 selection bias, 827 self-ratification, 240 self-supporting, 241 semantic tableaux, 459 Sense and Sensibilia, 1169 sensitivity, 106, 497, 498 analysis, 275 to the prior, 610, 611 sequences, 642–646, 648–651, 653, 656, 658, 671, 683, 690, 692 sequential analysis, 525 clustering, 965 prediction strategy, 882 sampling, 525 severe test, definition, 164 severely tested, 984 severity, 4 and fallacies of rejection, 168 and testing model assumptions, 193 in the case of accepting the null, 177 in the case of rejecting the null, 169 vs. power, 172 severity principle weak vs. full, 162 severity rationale, 162
1224 vs. behavioristic rationale, 163 Shafer–Dempster calculus, 407 Shannon entropy, 927 Shannon, C. E., 284, 680n, 875 Shao, J., 584 sharp, 1018 hypothesis testing, 300–302 shattering, 20, 851–853 Shibata, R., 584, 592 Shimony, A., 108 shooting contest, 993 shooting room paradox, 725, 726 short path problem, 1012 short run, 990 short simplicity paths, 1015 short-run truth-indication, 986 shrinkage, 933, 1082, 1084, 1093 Shtarkov, Y. M., 896 SIC, 601 Sierpinski, W., 658 sigificance level, 258 Σ1 , 461 σ-algebra, 100 σ-field, 100 σ10 , 461 σ20 , 461 significance, 752, 757, 758n, 759, 769, 770 significance test(s), 4, 90, 155, 199, 200, 204, 205, 208, 210– 215, 217, 219, 220–225, 257, 499, 500, 553, 554 similarity, 480 simple, 1018 hypotheses, 560, 562 law, 985 models, 522 random sample, 71 theory, 989 simpler theory, 988 simplest UTM, 953 simplex, 1016 simplicity, 2, 17, 19, 20, 22–24, 858– 860, 901, 983, 987
biases, 1020 degrees, 1012 Simpson’s paradox, 29, 31, 32, 93, 124, 310, 724, 726 Simpson, E. H., 124 skeptical attitude, 986 skeptical path(s), 1003, 1004 skewed, 66 skewness, 1034, 1061 Skyrms, B., 1n, 111n, 121, 123, 335, 344, 370, 480, 481, 481n, 483, 1169 Sleeping Beauty problem, 726, 728 Slutsky’s theorem, 1027, 1032–1034, 1045, 1062, 1063 Smith, A. F. M., 583 SMML, 934, 937 SMML code-book, 938 Smullyan, R., 459, 463 Snir, M., 373 Snob program, 965 Sober, E., 17, 18, 22, 38, 45, 110, 535n, 584, 589, 598, 599 Solomonoff, R. J., 36, 645n, 647, 648n, 675, 683, 866, 896, 909, 912, 930, 954 Solomonoff-Kolmogorov-Chaitin invariance theorem, 678 Solovay, R. M., 675 space shuttle Challenger, 60 Spanos, A., 4, 11, 16, 19, 45 sparsity, 602 spatial clustering, 965 spatial correlation, 1134, 1138 spatial-temporal correlation, 1126, 1138 spatial-temporal error, 1133 specific conditioning logic, 418 specificity, 313, 314, 497, 498 speed of light, 82 Speigelhalter, D. J., 594 spending of test size, 525 Spirtes, P., 25, 26, 46, 109 square-root inverted gamma, 278 squared error, 991
1225 ´ Sridhar, 1176 Srinivasan, C., 38, 41, 46 SRM, 947 St. Petersburg paradox, 731 stability of knowledge, 1001 Stalnaker’s hypothesis, 128 Stalnaker, R. C., 128, 129 stalwartness, 1005 violators, 1010 standard deviation, 64, 68 standard error, 207, 224 standard normal distribution, 86 standardization, 820 state description(s), 396, 453–456 states, 237 statistic, 80 Statistical Methods and Scientific Inference, 1170 statistical consistency, 934, 935, 940, 945, 952, 963 evidence, 493 hypotheses, 517 inconsistency, 944, 954, 957 inference, 2, 3, 62, 64, 80, 82, 270, 474, 493, 517, 1022 invariance, 933–935, 944, 945 learning theory, 19, 849, 850, 854, 858, 860, 946, 947 likelihood function, 939 link, 997 model, 154, 517 paradigm, 493 paradox(es), 29, 30, 737 techniques, 985 test, 234, 257 verisimilitude, 487 statistically consistent, 936, 945 inconsistent, 941 invariant, 932, 934, 936, 940, 942, 945 statistics, 493, 1021 steel post, 992
Steel, D., 20, 46 Stegm¨ uller, W., 477, 478 Stein estimator, 41 Stein’s paradox, 3, 38, 39, 41, 290 Stein, C., 39 stem cells, 966 Sterzinger, O., 1149 stochasticity; stochastic, 672, 698, 702, 703 Church, 672, 701, 702 for finite strings, 681, 687 Kolmogorov-Loveland-, 701–703, 705 laws, see randomness Mises-Wald-Church, 672, 701–705 randomness, see randomness Stone, M., 588, 594 stopping rule(s), 434, 502, 519, 626 stopping rule principle, 187 stream of experience, 1004 strength, 314 of evidence, 82, 250, 254, 494, 520 strict coherence, 108 Strict Minimum Message Length (SMML), 936, 937–939 Strict Strict MML (SSMML), 938 strings, 643–649, 656–658, 664, 667– 668, 675–689, 691–692 Strong Inference, 514 strong argument for Ockham’s razor, 1011 strong normality, 659–660 strongly analogical method, 480, 481 strongly beats, 1006, 1007 structural axioms, 239 structural constraints, 324 structural equations, 816 structural risk minimisation (SRM), 894, 996, 997n structure-description, 455, 456 struture of the decision problem, 269 Student, 206, 277, 294 Student’s t-test, 206
1226 subjective and objective Bayes/Bayesian, 895, 988 Bayes compromise, 614 interpretation, 106 model weights, 614 prior, 517, 558 probability, 143, 144, 146, 235, 236 reversed senses, 1169, 1170 view, 637 subjectivists, 217, 635 subjectivity, 16 submodel, 584 sufficiency, 565, 566n, 567, 571, 575 sufficient state, 568 sufficient statistic(s), 19, 258, 275, 939, 1043, 1060 sufficient-component cause model, 828 summary statistics, 66 ´su ¯nyav¯ ada and zeroism, 1188 supertask and convergence, 1185 defined, 1183 involved in any notion of limits, 1183 involved in representing a formal real number, 1187 made possible by set theory, 1183 not involved in Indian way to sum infinite series, 1187 related to neglect of small numbers, 1185 Suppes, P., 108 support curve, 523 support interval, 523 support vector machines (SVMs), 946, 947, 964, sure-thing priciple, 269 Su´sruta, 1176 swarm intelligence, 958 sweet spot, 993 switch code, 889 sy¯ adav¯ ada, 1192
symmetry, 67, 996 System of Logic, 1159, 1163 systematic errors, 813 T -test, 205 TAIG, 940 tail area probability, 499 Taper, M., 14–16, 45 target, 991, 993 formula, 309 Tarski, A., 459, 461, 662 tautology, 56 Teng, C. M., 41, 46 test, 757, 758, 769–772, 984 of significance, 82, 87 statistic, 89, 257, 258, 300 statistic in testing, 156 testability, 849, 854, 860 testing, 82 hypothesis, 91 the value of a normal mean, 301 tests for randomness, 637, 638 The Civilization of the Renaissance in Italy, 1169 the 68-95-99.7 rule, 76 theoretical question, 1002 theoretical terms, 245 theoretical truth, 1000 Th´eorie analytique, 1160 theory of inductive probability, 476 thermometer, 985 three basic rules, 56 three key assumptions, 86 three measures of central tendency, 64 three standard deviations of the mean, 70 time’s arrow, 966, 969 TIP, 478, 480, 481, 485 topological structure, 1002 tracking, 1000 Traite du calcul des probabilit´es, 1164 transformation invariance, 516n transitivity, 405
1227 of preference, 269 Treatise on Probability, 1171 trend detection, 1128 trends, 993 trial, 199 triviality results, 129 triviality theorem, 465 true and causal theory, 999 true theory(ies), 987, 991, 1021 truncated sequential design, 503 truth, 514, 519, 522–524, 985 finding, 869, 885 conducive, 987, 1002 conduciveness, 985, 986, 990, 991, 1000 indicator, 985 related virtues, 12 truthlikeness, 515 Tsao, C. A., 32, 34, 46 Turgot, A.-R.-J., 1153 Turing degrees, 697 equivalence, 697, 698 reducibility, 697 machine(s), 461, 873, 902, 903, 909–913, 939, 946, 955, 960, 967, 990 Turing, A., 658, 663 Tversky, A., 433 two envelope (or exchange) problem, 126, 732 733, 734 two standard deviations of the mean, 70 two-part, 939 codes, 878 compression, 902, 957 construction, 912 file compression, 913 inference, 903 Kolmogorov complexity, 939 message, 902, 936, 948, 949 MML, 912 two-sided test, 90 two-tailed significance test, 211
Type I error, 90, 499, 506, 520, 526 Type II error, 90, 506, 520 typical performance, 994 typicality, 646–647, 673, 705–707 ¨ Uber die Lehre des Wahrscheinlichen, 1158n Ulam, S., 645 unbiased estimate, 292, unbiased estimator, 17, 535, 536, 541 unbiased thermometer, 538 uncertain evidence, 431 uncertain inference, 91 uncertainty, 263–265, 269, 271 unconditional probability, 58, 99 undecidability, 960, 967 under- and overfitting, 868, 894, 993, 1102 under-aim, 993 under-determination of theory, 161, 515, 814 unified explanation, 984 uniform, 985 code, 867, 870 distribution, 1051 prior, 285, 932 uniformity of nature, 2, 996 uniformly effective open, see effective open uninformative priors, 1166n union, 55 uniquely decodable code, 874 uniqueness, 951 property, 916 property of log-loss scoring, 914 unit cube, 1018 unity, 984 universal bound, 501, 522 universal code, 869, 876, 877 for the integers, 872 universal comparability, 401, 403 universal generalisations, 430 universal Turing machine (UTM), 911, 912, 939, 948, 953, 961
1228 unobserved common cause, 999 unpredictability, 646–647, 654, 657, 670, 700, 705–707 maximal, 647 unreliability of inference and statistical misspecifications, 190 unstable, 1009 Urbach, P., 335, 368, 450, 536 useful models, 987 utility, 238 function, 269, 560 maximisation, 240 of smallpox vaccination, 1152 or loss function, 558 vague probability, 113 valuation, 1018 van de Geer, S. A., 603 van Fraassen, B., 107, 113, 125, 131 Vapnik, V. N., 602, 858, 946 Vapnik-Chenovenkis (VC) dimension, 20, 849, 851–854, 856–860, 946, 947 variance, 64, 546, 992, 993, 1103 of a random variable, 74 vector of interest, 276 Venn, J., 43, 634, 1149, 1150, 1159, 1160, 1162, 1163, 1165, 1171 verifiability, 1001 verisimilitude, 473, 484–486, 515, 516 verisimilitudinarian BIL, 484, 485 Ville, J. A., 36, 672 Vineberg, S., 30–32, 46 violations of eventual informativeness, 1010 virtue, 968 virus, 266, 273 Voltaire (Arouet, F. M.), 43, 1153 von Mises circular distribution, 964, 965 von Mises, R., see Mises, R. von von Mises-Fisher spherical distribution, 964
von Neumann trick, 653, 670 von Neumann, J., see Neumann, J. von Waerden, B. L. van der, 655 Wald, A., 671 Wallace non-universality problem, 913 Wallace, C. S., 110, 121, 909, 930, 936 Walley, P., 113 wave theory of light, 1008 WCP, 567 weak conditionality principle, 563–565, see WCP, 569, 571, 572n, 575 weak evidence, 520, 526 weak prequential principle, 890 weak sufficiency principle, 564–566, see WSP, 571, 572n, 575 weakly analogical exchangeable methods, 481 weakly analogical method, 480 weakly beaten, 1006 Webb, G., 893 Wegener, A., 931 weighted average squared distance, 74 weighted averages and arbitrage in early India, 1178 and mathematical expectation, 1178 Weirich, P., 5, 11, 34, 45 welded rifle, 993 Weyl, H., 645n, 660 Wheeler, G., 10, 45 William of Ockham, 1170 Williamson, J., 10, 45, 335 Williamson, T., 128 wishful thinking, 1002 Woodward, J. F., 109 worst-case retraction cost, 1005 WSP, 567 yoga, 769 z-score, 201 z-test, 200, 201 Zabell, S. L., 43, 46, 122
1229 zero-one law, 652, 654 zeroism, 1185 and ´su ¯nyav¯ ada, 1188 and conditioned coorigination, 1188 and fear of small numbers, 1189 and impermanence, 1189 as a realist and fallibilist position, 1189 how does one represent a continually changing entity?, 1188 impossibility of representing a continually changing entity, 1188 non-representability changes arithmetic, 1188 non-representability of a formal real number, 1187
This page intentionally left blank
This page intentionally left blank
This page intentionally left blank